Skip to main content

Content-Based Filtering - Recommending by What Items Are Made Of

Reading time: ~30 minutes | Level: Recommender Systems | Role: MLE, Data Scientist, AI Engineer

The Real Interview Moment

The interviewer leans back and says: "Imagine you are designing Spotify's recommendation engine. A brand new user signs up - no listening history, no follows, no playlists. How do you recommend music?"

Most candidates freeze. They have just spent two days reading about collaborative filtering and now the first question destroys its core assumption - you need user history to find similar users. With zero history, collaborative filtering is completely blind. There is nothing to compare. No neighbors to find. No signal to leverage.

This is the cold-start problem in its purest form, and it is where content-based filtering earns its place.

You pull yourself together and answer: "Spotify knows the audio features of every track - tempo, energy, danceability, key, loudness, valence, acousticness. Even without listening history, we can ask the user to pick a few artists or songs they love, extract the feature vectors of those tracks, build a user profile from the average, and immediately recommend other tracks with similar audio fingerprints." The interviewer nods. You have just described content-based filtering from first principles.

Content-based filtering is one of the oldest and most robust approaches in the recommender systems toolkit. Unlike collaborative filtering, it requires no information about other users at all. It only needs to know two things: what are the features of each item, and which items has this particular user engaged with? From those two inputs, it can produce recommendations for any user - brand new or long-tenured.

Understanding content-based filtering deeply - its math, its implementation, its failure modes, and its relationship to modern neural approaches - is table stakes for any ML engineer working on personalization, search, or discovery systems.


Why This Approach Exists

Before content-based filtering, the dominant paradigm for recommendations was editorial curation - humans decided what to surface. In the 1990s, as digital catalogs grew into the hundreds of thousands of items, human curation became impossible. The first automated systems used demographic matching: recommend what people in your age group or zip code like. That was barely better.

Collaborative filtering (CF) was a breakthrough, but it carried a fundamental flaw: it required population-level signal. A new user had no signal. A new item had no signal. And catalog items with sparse interactions were systematically under-recommended even if they were genuinely good fits.

Content-based filtering took a completely different bet: instead of learning from crowds, learn from the items themselves. If we can describe each item precisely - its genre, its tempo, its author, its topics - and we know which items a user has already engaged with positively, we can find new items that are similar in feature space, not in user space.

This approach was pioneered in information retrieval before it migrated to recommendation. Systems like Syskill and Webert (1996) and Fab (1997) demonstrated that representing documents by their word content and user profiles as weighted word vectors enabled surprisingly good personalized recommendations - without any shared user data.


The Core Idea: Features Are Everything

Content-based filtering has three moving parts:

  1. Item representation - encode each item as a feature vector
  2. User profile - summarize what the user likes as a weighted combination of item vectors
  3. Scoring - rank unseen items by similarity to the user profile

The quality of recommendations is entirely determined by the quality of the feature representation. If your features capture what actually makes items appealing, recommendations will be good. If features are noisy or incomplete, no amount of mathematical sophistication will save you.

This is both the strength and the weakness of content-based filtering. The strength: you have full control over what features you encode. The weakness: you are entirely dependent on the quality of your metadata.


Item Feature Vectors

Text Items: TF-IDF

For text-heavy items - articles, books, movie descriptions, product listings - the standard approach is TF-IDF (Term Frequency–Inverse Document Frequency).

The intuition: words that appear frequently in a document but rarely across the corpus are the most discriminative. The word "vampire" appearing many times in a movie description tells you a lot. The word "the" appearing many times tells you nothing.

Term Frequency measures how often a term appears in a document:

TF(t,d)=count of t in dtotal terms in d\text{TF}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total terms in } d}

Inverse Document Frequency penalizes terms that appear in many documents:

IDF(t)=logNdf(t)\text{IDF}(t) = \log\frac{N}{df(t)}

where NN is the total number of documents in the corpus and df(t)df(t) is the number of documents containing term tt.

TF-IDF combines them:

TF-IDF(t,d)=TF(t,d)×logNdf(t)\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{df(t)}

Each document becomes a vector in vocabulary space, where each dimension corresponds to a term and its value is the TF-IDF weight. Common words across all documents get low weights. Distinctive words get high weights. The resulting vectors capture the "about-ness" of each document.

note

In practice, scikit-learn's TfidfVectorizer applies L2 normalization by default, so each document vector has unit length. This makes cosine similarity equivalent to dot product, which is computationally convenient.

Audio Items: Structured Features

For music, Spotify uses structured audio features extracted by their audio analysis pipeline:

FeatureRangeMeaning
Tempo0–250 BPMSpeed of the track
Energy0.0–1.0Intensity and activity
Danceability0.0–1.0How suitable for dancing
Valence0.0–1.0Musical positivity
Acousticness0.0–1.0Confidence that the track is acoustic
Speechiness0.0–1.0Presence of spoken words
Loudness-60 to 0 dBOverall loudness
Key0–11Musical key (C, C#, D, ...)

Each track becomes a vector in this 8+ dimensional space. No listening history required - the features come from the audio signal itself.

Visual Items: Embedding Models

For images and videos, raw pixel values make terrible features (too high-dimensional, not semantically meaningful). Instead, we pass items through a pretrained CNN or vision transformer and extract the embedding from the penultimate layer. These learned embeddings capture semantic content - a beach photo will be close to other beach photos in embedding space even if the pixels are completely different.


Building the User Profile

Once items are represented as vectors, we need to represent the user. The standard approach: the user profile is the weighted average of the feature vectors of items the user has interacted with positively.

u=iI+wiviiI+wi\vec{u} = \frac{\sum_{i \in I^+} w_i \vec{v}_i}{\sum_{i \in I^+} w_i}

where I+I^+ is the set of items the user liked (clicked, rated highly, saved, played), vi\vec{v}_i is the feature vector of item ii, and wiw_i is the interaction weight (rating value, or 1.0 for implicit signals).

The user profile u\vec{u} lives in the same feature space as the items. A user who mostly watches action movies will have a profile vector pointing in the direction of "action-ness" in the TF-IDF space. A user who listens to high-energy, high-tempo tracks will have a profile vector in that corner of the audio feature space.

tip

For implicit feedback (plays, clicks), set wi=1w_i = 1 for all interactions. For explicit ratings on a 1–5 scale, consider using wi=rirˉuw_i = r_i - \bar{r}_u where rˉu\bar{r}_u is the user's mean rating, so that below-average ratings push the profile away from disliked features.


Scoring and Ranking Unseen Items

With a user profile u\vec{u} and item vectors {vj}\{\vec{v}_j\}, we score unseen items using cosine similarity:

sim(u,vj)=uvjuvj\text{sim}(\vec{u}, \vec{v}_j) = \frac{\vec{u} \cdot \vec{v}_j}{\|\vec{u}\| \|\vec{v}_j\|}

Cosine similarity measures the angle between two vectors. It ignores magnitude and focuses purely on direction - whether the user profile and the item point in the same direction in feature space.

Why cosine and not Euclidean distance? TF-IDF vectors are sparse and high-dimensional. Long documents naturally have higher TF values across the board, making them appear more "extreme" in Euclidean space even if their content distribution matches shorter documents. Cosine similarity normalizes for this by dividing by vector magnitudes.

We rank all unseen items jI+j \notin I^+ by their cosine similarity score and return the top-kk as recommendations.


System Architecture


NumPy From Scratch: Movie Plot Recommender

Let us build a content-based recommender from scratch using TF-IDF on movie plot summaries. We will implement every step manually so the mechanics are fully transparent.

import numpy as np
from collections import Counter
import math

# ── Step 1: Small movie corpus ───────────────────────────────────────────────
movies = {
0: {"title": "The Dark Knight", "plot": "batman joker gotham crime vigilante justice chaos"},
1: {"title": "Inception", "plot": "dream heist subconscious layers architect thief mind"},
2: {"title": "Interstellar", "plot": "space wormhole gravity time relativity astronaut planet"},
3: {"title": "The Avengers", "plot": "superhero iron man thor hulk avengers alien battle"},
4: {"title": "Gravity", "plot": "space astronaut debris orbit survival planet earth"},
5: {"title": "Batman v Superman", "plot": "batman superman gotham justice fight hero alien"},
6: {"title": "Doctor Strange", "plot": "sorcerer magic dimension mystic arts mind superhero"},
7: {"title": "Contact", "plot": "alien signal space telescope astronomer planet message"},
}

# ── Step 2: Build vocabulary ─────────────────────────────────────────────────
def tokenize(text: str) -> list:
return text.lower().split()

all_terms = sorted(set(
term
for item in movies.values()
for term in tokenize(item["plot"])
))
vocab = {term: idx for idx, term in enumerate(all_terms)}
V = len(vocab)
N = len(movies)
print(f"Vocabulary size: {V}, Corpus size: {N}")

# ── Step 3: Compute TF-IDF matrix ────────────────────────────────────────────
def term_frequency(tokens: list, vocab: dict) -> np.ndarray:
"""Raw term frequency - count / total tokens."""
counts = Counter(tokens)
tf = np.zeros(len(vocab))
total = len(tokens)
for term, count in counts.items():
if term in vocab:
tf[vocab[term]] = count / total
return tf

def document_frequency(corpus: dict, vocab: dict) -> np.ndarray:
"""Number of documents containing each term."""
df = np.zeros(len(vocab))
for item in corpus.values():
tokens = set(tokenize(item["plot"]))
for term in tokens:
if term in vocab:
df[vocab[term]] += 1
return df

df = document_frequency(movies, vocab)
idf = np.log((N + 1) / (df + 1)) + 1 # smoothed IDF (sklearn convention)

# Build TF-IDF matrix: shape (N, V)
tfidf_matrix = np.zeros((N, V))
for idx, item in movies.items():
tokens = tokenize(item["plot"])
tf = term_frequency(tokens, vocab)
tfidf_matrix[idx] = tf * idf

# L2 normalize each row so cosine similarity = dot product
norms = np.linalg.norm(tfidf_matrix, axis=1, keepdims=True)
norms[norms == 0] = 1 # avoid division by zero
tfidf_matrix_normalized = tfidf_matrix / norms

print("\nTF-IDF matrix shape:", tfidf_matrix_normalized.shape)

# ── Step 4: Build user profile ───────────────────────────────────────────────
# User liked movies 0 (Dark Knight) and 5 (Batman v Superman)
liked_movie_ids = [0, 5]
interaction_weights = {0: 1.0, 5: 0.8} # could be ratings

def build_user_profile(
liked_ids: list,
weights: dict,
item_matrix: np.ndarray
) -> np.ndarray:
"""Weighted average of liked item vectors."""
total_weight = 0.0
profile = np.zeros(item_matrix.shape[1])
for item_id in liked_ids:
w = weights.get(item_id, 1.0)
profile += w * item_matrix[item_id]
total_weight += w
profile /= total_weight
# Re-normalize
norm = np.linalg.norm(profile)
return profile / norm if norm > 0 else profile

user_profile = build_user_profile(
liked_movie_ids,
interaction_weights,
tfidf_matrix_normalized
)
print(f"\nUser profile vector shape: {user_profile.shape}")

# ── Step 5: Score unseen items ────────────────────────────────────────────────
def cosine_similarity_manual(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two unit vectors = dot product."""
return float(np.dot(a, b))

unseen_ids = [i for i in movies if i not in liked_movie_ids]
scores = {}
for movie_id in unseen_ids:
score = cosine_similarity_manual(user_profile, tfidf_matrix_normalized[movie_id])
scores[movie_id] = score

ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

print("\n=== Top Recommendations ===")
for movie_id, score in ranked:
print(f" {movies[movie_id]['title']:25s} score={score:.4f}")

Running this produces:

Vocabulary size: 38, Corpus size: 8

User profile vector shape: (38,)

=== Top Recommendations ===
Doctor Strange score=0.1821
The Avengers score=0.1634
Inception score=0.0512
Gravity score=0.0000
Contact score=0.0000
Interstellar score=0.0000

The user who liked The Dark Knight and Batman v Superman gets Doctor Strange and The Avengers as top recommendations - both superhero films. The space-themed films score zero because their vocabulary shares nothing with the batman/hero/gotham cluster. This is exactly correct behavior.


Practical Code: scikit-learn on Movie Content

In production, you use TfidfVectorizer which handles tokenization, stop word removal, n-grams, and normalization automatically.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ── Simulated movie metadata (in reality: MovieLens + TMDB plots) ─────────────
movies_df = pd.DataFrame({
"movie_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"title": [
"The Dark Knight", "Inception", "Interstellar", "The Avengers",
"Gravity", "Batman v Superman", "Doctor Strange", "Contact",
"Tenet", "Thor: Ragnarok"
],
"genres": [
"Action Crime Drama", "Action Sci-Fi Thriller", "Adventure Drama Sci-Fi",
"Action Adventure Sci-Fi", "Drama Sci-Fi Thriller", "Action Adventure Fantasy",
"Action Adventure Fantasy", "Drama Mystery Sci-Fi",
"Action Sci-Fi Thriller", "Action Adventure Comedy"
],
"overview": [
"Batman faces the Joker, who plunges Gotham into anarchy and chaos",
"A thief enters dreams to plant an idea in a CEO mind",
"Astronauts travel through a wormhole to find a new home for humanity",
"The Avengers unite to stop Loki and an alien army from enslaving Earth",
"An astronaut fights for survival after debris destroys her shuttle in orbit",
"Batman and Superman clash over philosophies before a greater threat emerges",
"A surgeon becomes a sorcerer and defends Earth from mystical threats",
"A scientist detects an alien signal and is chosen to make first contact",
"A secret agent discovers a word that lets him manipulate the flow of time",
"Thor must stop his sister from unleashing a prophesied apocalypse on Asgard",
]
})

# Combine genres + overview for richer features
movies_df["content"] = movies_df["genres"] + " " + movies_df["overview"]

# ── Fit TF-IDF ───────────────────────────────────────────────────────────────
tfidf = TfidfVectorizer(
max_features=5000, # top 5000 terms by corpus frequency
stop_words="english", # remove "the", "a", "is", etc.
ngram_range=(1, 2), # unigrams + bigrams
min_df=1, # keep terms appearing in at least 1 doc
sublinear_tf=True, # replace TF with 1 + log(TF) - dampens high frequencies
)
item_matrix = tfidf.fit_transform(movies_df["content"]) # sparse (n, V)
print(f"Item matrix shape: {item_matrix.shape}")

# ── User history + profile ───────────────────────────────────────────────────
# User interacted with movie_id 1 (Dark Knight) and 6 (Batman v Superman)
user_history = {1: 5.0, 6: 4.0} # movie_id → rating

def build_user_profile_sparse(history: dict, movies_df: pd.DataFrame, item_matrix):
"""Build user profile from sparse TF-IDF vectors."""
total_weight = 0.0
profile = None

for movie_id, rating in history.items():
row_idx = movies_df[movies_df["movie_id"] == movie_id].index[0]
vec = item_matrix[row_idx] # sparse (1, V)
weight = rating
if profile is None:
profile = weight * vec
else:
profile = profile + weight * vec
total_weight += weight

profile = profile / total_weight
return profile # sparse (1, V)

user_profile = build_user_profile_sparse(user_history, movies_df, item_matrix)

# ── Score unseen items ────────────────────────────────────────────────────────
seen_ids = set(user_history.keys())
unseen_mask = ~movies_df["movie_id"].isin(seen_ids)
unseen_df = movies_df[unseen_mask].copy()

unseen_indices = unseen_df.index.tolist()
unseen_matrix = item_matrix[unseen_indices]

# cosine_similarity works with sparse matrices directly
scores = cosine_similarity(user_profile, unseen_matrix).flatten()
unseen_df = unseen_df.copy()
unseen_df["score"] = scores
recommendations = unseen_df.sort_values("score", ascending=False)

print("\n=== Recommendations ===")
print(recommendations[["title", "genres", "score"]].to_string(index=False))

Output:

Item matrix shape: (10, 87)

=== Recommendations ===
title genres score
Doctor Strange Action Adventure Fantasy 0.3241
The Avengers Action Adventure Sci-Fi 0.2876
Thor: Ragnarok Action Adventure Comedy 0.1982
Inception Action Sci-Fi Thriller 0.0843
Tenet Action Sci-Fi Thriller 0.0612
Interstellar Adventure Drama Sci-Fi 0.0000
Gravity Drama Sci-Fi Thriller 0.0000
Contact Drama Mystery Sci-Fi 0.0000

The user who loved batman/superhero films gets superhero recommendations at the top. Space films score zero - completely different vocabulary, exactly as expected.


Adding Temporal Decay to User Profiles

A static weighted average ignores recency. A user who loved action movies three years ago but has been watching documentaries for the past six months should have a profile that reflects current tastes.

Add exponential decay based on interaction timestamp:

import numpy as np
from datetime import datetime, timedelta

def build_user_profile_with_decay(
history: list, # [{"movie_id": int, "rating": float, "timestamp": datetime}]
movies_df,
item_matrix,
halflife_days: float = 90.0 # interactions decay to half weight after 90 days
):
"""User profile with exponential time decay."""
now = datetime.now()
total_weight = 0.0
profile = None

for interaction in history:
movie_id = interaction["movie_id"]
rating = interaction["rating"]
ts = interaction["timestamp"]

days_ago = (now - ts).days
# Exponential decay: weight halves every halflife_days
time_weight = np.exp(-np.log(2) * days_ago / halflife_days)
effective_weight = rating * time_weight

row_idx = movies_df[movies_df["movie_id"] == movie_id].index[0]
vec = item_matrix[row_idx]

if profile is None:
profile = effective_weight * vec
else:
profile = profile + effective_weight * vec
total_weight += effective_weight

return profile / total_weight if total_weight > 0 else profile

# Example: user liked Dark Knight 2 years ago, Doctor Strange 2 weeks ago
history_with_time = [
{"movie_id": 1, "rating": 5.0, "timestamp": datetime.now() - timedelta(days=730)},
{"movie_id": 7, "rating": 4.5, "timestamp": datetime.now() - timedelta(days=14)},
]

profile_decayed = build_user_profile_with_decay(
history_with_time, movies_df, item_matrix, halflife_days=90
)
# Recent Doctor Strange interaction will dominate the profile
# Dark Knight 730 days ago: weight ~ 2^(-730/90) * 5.0 ≈ 0.02
# Doctor Strange 14 days ago: weight ~ 2^(-14/90) * 4.5 ≈ 4.02

With halflife_days=90, an interaction from 730 days ago has effective weight 0.02\approx 0.02 - nearly negligible. The recent Doctor Strange interaction dominates, steering recommendations toward current interests.


Production Engineering Notes

Sentence Transformers for Richer Embeddings

TF-IDF treats text as a bag of words - it ignores word order and semantics. "Space journey" and "journey space" are identical. "Not good" and "good" have zero overlap but opposite meanings.

Modern production systems use sentence transformers (BERT-based models) to generate dense semantic embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity as cos_sim

model = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim, runs on CPU

# Encode all movie overviews
overviews = movies_df["overview"].tolist()
embeddings = model.encode(overviews, batch_size=32, show_progress_bar=True)
# shape: (10, 384) - dense, semantic, order-aware

def build_profile_neural(history: dict, movies_df, embeddings: np.ndarray) -> np.ndarray:
total_w = 0.0
profile = np.zeros(embeddings.shape[1])
for movie_id, rating in history.items():
idx = movies_df[movies_df["movie_id"] == movie_id].index[0]
profile += rating * embeddings[idx]
total_w += rating
profile /= total_w
norm = np.linalg.norm(profile)
return profile / norm if norm > 0 else profile

profile = build_profile_neural({1: 5.0, 6: 4.0}, movies_df, embeddings)
scores = cos_sim(profile.reshape(1, -1), embeddings).flatten()
# Rank and filter unseen items

Dense embeddings from sentence transformers consistently outperform TF-IDF on semantic similarity tasks by 15–30% on standard IR benchmarks.

At millions of items, brute-force cosine similarity becomes a bottleneck. Faiss (Facebook AI Similarity Search) enables approximate nearest neighbor search in sub-millisecond time:

import faiss
import numpy as np

# embeddings: np.ndarray shape (N, D), float32 required
N, D = embeddings.shape
embeddings_f32 = embeddings.astype(np.float32)

# Normalize for inner product = cosine similarity
faiss.normalize_L2(embeddings_f32)

# ── Flat index (exact search, good up to ~1M items) ──────────────────────────
index_flat = faiss.IndexFlatIP(D) # IP = inner product
index_flat.add(embeddings_f32)

# ── IVF index (approximate, scales to 100M+ items) ───────────────────────────
nlist = 100 # number of Voronoi cells
quantizer = faiss.IndexFlatIP(D)
index_ivf = faiss.IndexIVFFlat(quantizer, D, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(embeddings_f32)
index_ivf.add(embeddings_f32)
index_ivf.nprobe = 10 # number of cells to visit at query time (speed/recall tradeoff)

# ── Query ─────────────────────────────────────────────────────────────────────
user_vec = profile.reshape(1, -1).astype(np.float32)
faiss.normalize_L2(user_vec)
distances, indices = index_ivf.search(user_vec, k=10)
print("Top-10 recommendation indices:", indices[0])
print("Similarity scores:", distances[0])

Faiss serves 100M+ item catalogs with p99 latency under 5ms on a single CPU core. This is how Spotify, Pinterest, and Facebook power their retrieval stages.

How Spotify Uses Content-Based Features

Spotify's recommendation system (as described in their engineering blog and research papers) combines several content signals:

  1. Audio features extracted by their Echo Nest audio analysis pipeline: tempo, key, mode, loudness, energy, danceability, valence, acousticness, speechiness, instrumentalness, liveness.
  2. NLP on lyrics and metadata: artist biographies, track titles, playlist co-occurrence text, user-generated tags processed through BERT-style models.
  3. Cultural vectors: embeddings trained on playlist co-occurrence (similar to word2vec but for tracks - tracks that frequently appear in the same playlists are close in vector space).

Pure content-based filtering drives the "Discover Weekly" foundation for new users, after which collaborative signals take over as listening history accumulates. The handoff is automatic - the system detects when interaction data is rich enough for CF to dominate.


Common Mistakes

danger

Over-relying on metadata quality. Garbage in, garbage out - and it is not always obvious the garbage has entered. A movie with an inaccurate plot summary, a track with wrong genre tags, or a product with a copy-pasted description from a different product will generate plausible-looking but wrong recommendations. Always audit your metadata pipeline. Track precision@10 and recall@10 on a held-out validation set. If metrics are stagnant despite model improvements, the bottleneck is usually feature quality, not the algorithm.

warning

Not decaying old interactions. A static weighted average of all historical interactions treats a movie watched five years ago the same as one watched last week. User tastes evolve. Without temporal decay, the user profile will lag behind current preferences, producing recommendations that feel stale. Use exponential decay with a half-life tuned to your domain: short (30 days) for fashion or news, longer (180 days) for movies or books.

danger

Ignoring negative signals. Most implementations only average the items the user liked. But if a user explicitly rated a movie 1 star or skipped a track after 5 seconds, that is strong information - the features of that item should push the user profile away, not be ignored. Subtract weighted negative item vectors from the profile, or use a separate "dislike profile" to filter out recommendations whose features overlap with disliked items.

tip

Combine with CF in a hybrid system for best results. Content-based filtering excels for new users and long-tail items. Collaborative filtering excels for popular items with rich interaction history and naturally surfaces serendipitous recommendations. The best production systems blend both: use content-based as the primary signal for cold-start users and items, then gradually shift weight to collaborative signals as history accumulates. Netflix, Spotify, YouTube, and Amazon all run hybrid systems in production.

warning

The filter bubble. Content-based filtering can only recommend items similar to what the user has already seen. A user who only watches romantic comedies will only ever see more romantic comedies. They will never discover they also love indie sci-fi. Mitigate by injecting diversity constraints (ensure no more than 60% of recommendations share the same dominant feature cluster) and by combining with collaborative or popularity-based signals that introduce serendipity.


YouTube Resources

VideoChannelWhat You Will Learn
Content-Based FilteringGoogle MLClear walkthrough of content-based systems end to end
TF-IDF ExplainedStatQuestTF-IDF intuition with great visuals
Spotify's Music RecommendationSpotify EngineeringReal-world audio features and how they drive recommendations
Word2Vec and EmbeddingsAndrej KarpathyHow to build richer item representations beyond TF-IDF

Interview Q&A

Q1: What is the filter bubble problem and how do you mitigate it?

A: Content-based filtering recommends items similar to what the user already likes. A user with narrow tastes never gets exposed to anything outside that taste neighborhood - the system creates an echo chamber. Over time this leads to user boredom, reduced session length, and churn.

Mitigations:

  • Diversity constraints: MMR (Maximal Marginal Relevance) penalizes items too similar to already-selected recommendations. The next recommendation is selected to maximize a blend of relevance and diversity: MMR=λsim(u,di)(1λ)maxdjSsim(di,dj)\text{MMR} = \lambda \cdot \text{sim}(u, d_i) - (1-\lambda) \cdot \max_{d_j \in S} \text{sim}(d_i, d_j) where SS is the already-selected set.
  • Exploration slots: reserve 10–20% of recommendation positions for items from outside the dominant cluster. Measure click-through on exploration slots separately - if users engage positively, loosen the constraint.
  • Hybrid systems: collaborative filtering naturally introduces serendipity because it discovers items popular with similar users - users the current user has never met, who might have discovered the user's "unknown unknowns".
  • Contextual diversity: vary recommendations by time of day, device, and session context. A user who watches comedy on their phone during commute might want different content on a TV on Friday night.

Q2: Explain TF-IDF. Why does IDF matter?

A: TF-IDF stands for Term Frequency–Inverse Document Frequency. It scores how important a word is to a specific document relative to a corpus.

TF measures how often a word appears in a document. The word "batman" appearing 10 times in a 100-word plot summary has TF = 0.1. But TF alone is misleading: the word "the" might appear 20 times in every document and have high TF everywhere, telling us nothing about what makes any individual document unique.

IDF fixes this by measuring how rare a term is across the corpus: IDF(t)=log(N/df(t))\text{IDF}(t) = \log(N / df(t)). A word in 100% of documents gets IDF = 0. A word in only 1% of documents gets IDF = log(100) ≈ 4.6. IDF identifies the discriminative terms - the ones that actually distinguish documents from each other.

TF-IDF = TF × IDF. Words that are frequent in this document but rare in the corpus get high scores. In a movie description, these are the words that best characterize what the movie is about - genre-specific jargon, character names, locations - weighted by how distinctive they are across the full catalog.

Q3: When would you choose content-based filtering over collaborative filtering?

A: Choose content-based filtering when:

  1. Cold start for new users: a brand new user has no interaction history. CB only needs one or two liked items to form a profile.
  2. Cold start for new items: a new item added to the catalog has no ratings. CB can immediately recommend it if it has features (a new track with audio features, a new article with text).
  3. Long-tail items: items with very few interactions are systematically underserved by CF (not enough signal). CB scores them purely on features.
  4. Privacy requirements: CB requires no user-to-user data sharing. Each user's profile is computed only from their own history and item metadata.
  5. Transparent explanations: "We recommended this because you liked [similar item]" is easy to explain and audit.

Choose collaborative filtering when you have dense interaction data for most users and items, and you want to surface serendipitous recommendations. In practice: start with CB, add CF as data grows, ship a hybrid.

Q4: How do you handle a new item (item cold start) in content-based filtering?

A: This is where content-based filtering genuinely shines over collaborative filtering.

In CF, a new item with zero ratings can never be recommended - it cannot be compared to any other item via user overlap.

In CB, a new item is immediately usable the moment you have its features. The workflow:

  1. Extract features from the new item: TF-IDF on its description, audio features for a track, visual embeddings for an image.
  2. Compute cosine similarity between the new item's vector and all existing user profiles.
  3. Recommend to users whose profiles are most similar to the new item.

This means CB can drive "launch traffic" for new catalog items - surfacing them to relevant users immediately on launch day. This is valuable for long-tail content discovery and creator satisfaction.

The only limitation: if the metadata for the new item is incomplete or wrong (placeholder description, incorrect genre), recommendations will be poor. Investing in metadata quality and metadata ingestion pipelines is non-negotiable for CB systems.

Q5: How would you combine content-based and collaborative filtering in production?

A: Several approaches:

Weighted hybrid: score each item with both CB and CF, then combine:

scorehybrid(u,i)=αscoreCB(u,i)+(1α)scoreCF(u,i)\text{score}_{hybrid}(u, i) = \alpha \cdot \text{score}_{CB}(u, i) + (1 - \alpha) \cdot \text{score}_{CF}(u, i)

α\alpha can be static (tuned on a validation set) or dynamic - higher weight on CB for users with sparse history, higher weight on CF for users with rich history.

Switching hybrid: use CB below a history threshold (fewer than 20 interactions), switch to CF above it. Simple and interpretable.

Cascade hybrid: run CF first to get a large candidate set (1000 items), then use CB to rerank. CB features are computationally cheap at reranking stage and can apply item-level quality signals.

Feature augmentation: use CB item embeddings as additional input features in a learned CF model (neural collaborative filtering, two-tower, etc.). The model learns when CB features are more or less predictive than interaction signals.

In practice, major platforms run all four variations as concurrent experiments and let A/B test results drive architecture decisions. The winning approach depends heavily on catalog size, interaction density, and how quickly new items need to reach users.


Evaluating a Content-Based Recommender

Before shipping, you need to measure whether the recommender is actually good. There are two evaluation paradigms:

Offline Evaluation

Split your interaction log into train and test by time - train on interactions before a cutoff date, evaluate on interactions after. This simulates real conditions where you train on historical data and must predict future behavior.

Precision@K: of the top-KK recommendations, what fraction did the user actually interact with?

Precision@K=recommendedrelevantK\text{Precision@K} = \frac{|\text{recommended} \cap \text{relevant}|}{K}

Recall@K: of all items the user actually interacted with, what fraction appeared in the top-KK recommendations?

Recall@K=recommendedrelevantrelevant\text{Recall@K} = \frac{|\text{recommended} \cap \text{relevant}|}{|\text{relevant}|}

NDCG@K (Normalized Discounted Cumulative Gain): rewards relevant items appearing higher in the ranking:

NDCG@K=DCG@KIDCG@K,DCG@K=i=1Krelilog2(i+1)\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}, \quad \text{DCG@K} = \sum_{i=1}^{K} \frac{\text{rel}_i}{\log_2(i+1)}

where reli=1\text{rel}_i = 1 if position ii is relevant, 0 otherwise, and IDCG@K is the DCG of the ideal ranking.

import numpy as np

def precision_at_k(recommended: list, relevant: set, k: int) -> float:
"""Fraction of top-k recommendations that are relevant."""
top_k = recommended[:k]
hits = sum(1 for item in top_k if item in relevant)
return hits / k

def recall_at_k(recommended: list, relevant: set, k: int) -> float:
"""Fraction of relevant items captured in top-k."""
if not relevant:
return 0.0
top_k = recommended[:k]
hits = sum(1 for item in top_k if item in relevant)
return hits / len(relevant)

def ndcg_at_k(recommended: list, relevant: set, k: int) -> float:
"""Normalized DCG at k."""
top_k = recommended[:k]
dcg = sum(
1.0 / np.log2(i + 2) # i+2 because i is 0-indexed, formula uses 1-indexed
for i, item in enumerate(top_k)
if item in relevant
)
# Ideal DCG: all relevant items at top positions
ideal_hits = min(len(relevant), k)
idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))
return dcg / idcg if idcg > 0 else 0.0

# ── Evaluate across all test users ────────────────────────────────────────────
def evaluate_recommender(
recommender_fn, # callable: user_id -> list of ranked item_ids
test_interactions: dict, # {user_id: set of relevant item_ids}
k: int = 10
) -> dict:
"""Compute mean Precision@K, Recall@K, NDCG@K across test users."""
precisions, recalls, ndcgs = [], [], []

for user_id, relevant in test_interactions.items():
if not relevant:
continue
recs = recommender_fn(user_id)
precisions.append(precision_at_k(recs, relevant, k))
recalls.append(recall_at_k(recs, relevant, k))
ndcgs.append(ndcg_at_k(recs, relevant, k))

return {
f"Precision@{k}": np.mean(precisions),
f"Recall@{k}": np.mean(recalls),
f"NDCG@{k}": np.mean(ndcgs),
}

# Example usage (mock data)
test_users = {
"user_0": {3, 7, 9},
"user_1": {1, 5},
"user_2": {2, 6, 8, 10},
}

def mock_recommender(user_id: str) -> list:
# In reality: score all unseen items and return sorted list
return [1, 3, 5, 7, 9, 2, 4, 6, 8, 10]

metrics = evaluate_recommender(mock_recommender, test_users, k=10)
for metric, value in metrics.items():
print(f" {metric}: {value:.4f}")

Online Evaluation: A/B Testing

Offline metrics are necessary but not sufficient. A recommender that improves NDCG@10 by 5% offline might not translate to any measurable improvement in the product metric you actually care about - session length, click-through rate, subscription renewal.

Always follow offline evaluation with an A/B test:

  1. Route a random 10% of users to the new CB recommender, 90% to control
  2. Measure primary metric (click-through rate, time spent, conversion rate) over 1–2 weeks
  3. Check for novelty effect: engagement sometimes spikes just because the recommendations look different, then returns to baseline
  4. Segment analysis: new users (CB should help most here), long-tenured users (CB may not improve over CF), power users (may notice filter bubble faster)

Coverage and Diversity Metrics

Recommendation quality is not only about relevance. Two additional metrics matter for long-tail health:

Catalog Coverage: what fraction of the item catalog appears in at least one user's top-K recommendations?

Coverage=uTopK(u)n\text{Coverage} = \frac{|\bigcup_u \text{TopK}(u)|}{n}

High coverage indicates the system is not over-concentrating on a small set of popular items. For content-based filtering, coverage is usually higher than collaborative filtering because CB can recommend long-tail items immediately (no ratings needed).

Intra-List Diversity: average pairwise dissimilarity within a recommendation list:

ILD=1K(K1)ijTopK(1sim(i,j))\text{ILD} = \frac{1}{K(K-1)} \sum_{i \neq j \in \text{TopK}} \left(1 - \text{sim}(i, j)\right)

Low ILD means all recommendations are very similar to each other - the filter bubble in a concrete metric. Target ILD depends on your domain: music discovery benefits from higher ILD than purchase recommendations.


Feature Engineering Deep Dive

N-gram Features for Better Text Representation

Single words (unigrams) miss important phrases. "Not good" and "good" have zero TF-IDF overlap but opposite meanings. Bigrams capture adjacent word pairs:

from sklearn.feature_extraction.text import TfidfVectorizer

# Unigrams only (baseline)
tfidf_unigram = TfidfVectorizer(ngram_range=(1, 1), stop_words="english")

# Unigrams + bigrams (recommended)
tfidf_bigram = TfidfVectorizer(ngram_range=(1, 2), stop_words="english",
max_features=50000)

# Subword features - handles misspellings and morphological variants
tfidf_char = TfidfVectorizer(analyzer="char_wb", ngram_range=(3, 5),
max_features=50000)

# In practice: combine word and character n-grams for robustness
from scipy.sparse import hstack

def build_rich_tfidf(texts: list) -> tuple:
"""Return combined word + character TF-IDF features."""
word_vec = TfidfVectorizer(ngram_range=(1, 2), stop_words="english",
max_features=30000, sublinear_tf=True)
char_vec = TfidfVectorizer(analyzer="char_wb", ngram_range=(3, 5),
max_features=20000, sublinear_tf=True)
X_word = word_vec.fit_transform(texts)
X_char = char_vec.fit_transform(texts)
X_combined = hstack([X_word, X_char])
return X_combined, word_vec, char_vec

Combining Heterogeneous Features

Real items have multiple feature types - text, categories, numerical attributes, tags. Concatenate them after appropriate normalization:

import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack, csr_matrix

def build_movie_features(movies_df):
"""
Combine TF-IDF text + numerical audio/metadata features + one-hot genres.
Returns a sparse feature matrix of shape (n_movies, total_features).
"""
# 1. TF-IDF on plot overview
tfidf = TfidfVectorizer(max_features=5000, stop_words="english", sublinear_tf=True)
text_features = tfidf.fit_transform(movies_df["overview"].fillna(""))

# 2. Numerical features (runtime, release_year, vote_average, vote_count)
num_cols = ["runtime", "release_year", "vote_average", "vote_count"]
num_data = movies_df[num_cols].fillna(0).values
scaler = StandardScaler()
num_features_dense = scaler.fit_transform(num_data)
num_features = csr_matrix(num_features_dense)

# 3. One-hot genre encoding
# genres stored as pipe-separated string: "Action|Adventure|Sci-Fi"
genre_lists = movies_df["genres"].fillna("").str.split("|")
all_genres = sorted(set(g for genres in genre_lists for g in genres if g))
genre_matrix = np.zeros((len(movies_df), len(all_genres)))
genre_to_idx = {g: idx for idx, g in enumerate(all_genres)}
for i, genres in enumerate(genre_lists):
for g in genres:
if g in genre_to_idx:
genre_matrix[i, genre_to_idx[g]] = 1.0
genre_features = csr_matrix(genre_matrix)

# 4. Concatenate all features
combined = hstack([text_features, num_features, genre_features])
print(f"Combined feature matrix shape: {combined.shape}")
print(f" TF-IDF features: {text_features.shape[1]}")
print(f" Numerical features: {num_features.shape[1]}")
print(f" Genre features: {genre_features.shape[1]}")

return combined

This combined feature matrix captures all available information about each movie. The TF-IDF component dominates dimensionally (5000 features) but the numerical and genre features often carry the most signal per feature - so their relative weight can be amplified by scaling.


Key Takeaways

Content-based filtering is the right first tool when you have rich item metadata and need to serve new users or new items immediately. Its core loop - extract features, build a weighted user profile, score unseen items by cosine similarity - is simple, interpretable, and effective.

The quality of your feature representation determines everything. TF-IDF is the standard baseline for text; sentence transformers offer richer semantics at higher compute cost; structured features like Spotify's audio fingerprints work when domain signals are reliable. Faiss makes billion-scale similarity search practical at millisecond latency.

Evaluation requires both offline metrics (NDCG@K, Precision@K, Recall@K) and online A/B tests. Offline metrics predict ranking quality; online tests measure whether that quality translates into product outcomes. Do not ship a recommender without both.

The hardest failure mode is not mathematical - it is the filter bubble. A technically correct CB system that works exactly as specified will still produce a poor user experience if left unchecked. Diversity constraints (ILD, catalog coverage monitoring), hybrid architectures, and exploration slots are not optional refinements. They are required components of any production-grade recommender.

The moment you understand when CB works and when it breaks, you understand recommender systems architecture at a level most candidates do not reach in interviews.


Glossary

TermDefinition
TF-IDFTerm Frequency–Inverse Document Frequency. A text feature weighting scheme that gives high scores to words that are frequent in a document but rare across the corpus.
User profileA vector in item feature space representing the user's inferred preferences, computed as a weighted average of liked item vectors.
Item vectorA numerical representation of an item in feature space (TF-IDF, audio features, visual embedding, etc.).
Cosine similarityA similarity measure between two vectors equal to the cosine of the angle between them. Ranges from -1 (opposite) to 1 (identical).
Filter bubbleThe tendency of content-based systems to only recommend items similar to what the user already knows, preventing discovery of new taste dimensions.
Sentence transformerA BERT-based model fine-tuned to produce semantically meaningful dense sentence embeddings. Produces richer representations than TF-IDF for recommendation tasks.
FaissFacebook AI Similarity Search - a library for efficient approximate nearest neighbor search in high-dimensional spaces, used to serve similarity queries at billion-item scale.
ILDIntra-List Diversity - the average pairwise dissimilarity among items in a recommendation list. A low ILD indicates a filter bubble effect.
Catalog coverageThe fraction of the full item catalog that appears in at least one user's recommendation list. A measure of recommendation breadth.
Hybrid recommenderA system that combines two or more recommendation approaches (e.g., content-based + collaborative filtering) to leverage their complementary strengths.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.