Text Features for ML
The Search That Couldn't Find What Users Wanted
The e-commerce search team had a NDCG@10 of 0.61 - reasonable by industry standards, but below the 0.72 target that engineering and product had agreed on. The model was a learning-to-rank system taking product title, category, and price as inputs, with user query as the primary text signal.
The team's first hypothesis: the model needs more data. They collected six additional months of click and purchase signals. NDCG improved to 0.63. Not enough. Second hypothesis: better model architecture. They experimented with LambdaMART, then a neural ranker. NDCG reached 0.65. Still not at target.
A senior engineer pointed out that neither intervention had changed the text representation. The query and product title were being encoded with TF-IDF computed over the product catalog - a sparse bag-of-words representation from 2001. If a user searched for "running shoes" and the product title said "jogging trainers," the TF-IDF vectors would have zero overlap. No shared vocabulary, no score, no ranking.
The team replaced TF-IDF with a bi-encoder: separate sentence transformer encoders for query and product, fine-tuned on click data. Query-product similarity was now computed in embedding space, where "running shoes" and "jogging trainers" map to nearby vectors. NDCG jumped to 0.79 in two weeks - an 18% improvement over baseline, clearing the target.
This lesson covers the complete journey from classical text features to production embedding pipelines, and the engineering decisions that separate a research notebook from a reliable feature system.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Engineering demo on the EngineersOfAI Playground - no code required. :::
Why This Exists: The Vocabulary Problem
Classical text representations treat each unique word as an independent dimension. A TF-IDF vector for "running shoes" has a component for "running" and a component for "shoes" - and nothing for "jogging" or "trainers." Two texts that mean the same thing with different words get orthogonal representations. The cosine similarity between them is zero.
This works adequately when the vocabulary is controlled and consistent - legal documents, technical specifications, medical records with standardized terminology. It fails when users express the same intent with varied language, which is every search query ever typed.
Embedding-based representations solve this by mapping text to a dense vector space learned from a large corpus. Semantically similar texts map to geometrically nearby vectors regardless of surface-level word overlap. The vector for "running shoes" is close to the vector for "jogging trainers" because both appear in similar contexts across millions of documents.
The shift from sparse TF-IDF to dense embeddings is the most significant improvement available in text feature engineering. But embeddings are not free - they require more compute to produce, more storage to cache, and more infrastructure to serve at low latency. Understanding both approaches, their trade-offs, and their production implications is what this lesson is about.
Historical Context
TF-IDF (Term Frequency-Inverse Document Frequency) was developed in the 1970s through work by Karen Spärck Jones (IDF, 1972) and Gerard Salton (vector space model, 1975). It remained the dominant text representation for information retrieval for three decades.
BM25 (Best Match 25) was introduced by Robertson et al. in 1994 as a probabilistically motivated improvement to TF-IDF, with better handling of document length normalization and term saturation. It remains the state of the art for classical (non-neural) retrieval and is the default ranking function in Elasticsearch and Apache Lucene.
Word2Vec (Mikolov et al., 2013) introduced learned dense word embeddings, showing that words with similar meanings have geometrically similar vector representations. This opened the door to semantic similarity computation.
Sentence-BERT (Reimers & Gurevych, 2019) extended the embedding approach to full sentences using siamese BERT networks, making it practical to encode and compare arbitrary text at sentence level. The Sentence Transformers library (built on this work) became the standard tool for semantic similarity and retrieval tasks.
OpenAI's text-embedding-ada-002 (2022) demonstrated that very large general-purpose embedding models could match or exceed task-specific fine-tuned models on many benchmarks, further democratizing embedding-based text features.
Core Concepts
Classical Text Features: TF-IDF and BM25
TF-IDF scores each term in a document by combining its term frequency in the document with the inverse of how often it appears across all documents:
High TF-IDF: a word that appears frequently in this document but rarely in the corpus - a "distinctive" term.
BM25 improves on TF-IDF with term saturation (diminishing returns for high term frequency) and document length normalization:
Typical values: , .
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from rank_bm25 import BM25Okapi
import re
from typing import List
def clean_text(text: str) -> str:
"""Standard text cleaning pipeline for ML features."""
if not isinstance(text, str):
return ""
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r"http\S+|www\S+", " ", text)
# Remove email addresses
text = re.sub(r"\S+@\S+", " ", text)
# Remove special characters but keep hyphens and apostrophes
text = re.sub(r"[^a-z0-9\s\-']", " ", text)
# Collapse multiple spaces
text = re.sub(r"\s+", " ", text).strip()
return text
class TextFeatureExtractor:
"""
Combined classical and metadata text feature extraction.
Produces a feature matrix that can be fed directly to a ranking model.
"""
def __init__(
self,
max_tfidf_features: int = 50000,
lsa_components: int = 100, # dimensionality after SVD
):
self.tfidf = TfidfVectorizer(
max_features=max_tfidf_features,
ngram_range=(1, 2), # unigrams + bigrams
min_df=5, # ignore very rare terms
max_df=0.95, # ignore very common terms
sublinear_tf=True, # log(1+tf) instead of raw tf
strip_accents="unicode",
)
self.svd = TruncatedSVD(n_components=lsa_components, random_state=42)
self.fitted = False
def fit_transform(self, texts: List[str]) -> np.ndarray:
"""Fit TF-IDF + LSA on corpus and return dense feature matrix."""
cleaned = [clean_text(t) for t in texts]
tfidf_matrix = self.tfidf.fit_transform(cleaned)
lsa_matrix = self.svd.fit_transform(tfidf_matrix)
self.fitted = True
return lsa_matrix # shape: (n_docs, lsa_components)
def transform(self, texts: List[str]) -> np.ndarray:
"""Transform new texts using fitted TF-IDF + LSA."""
assert self.fitted, "Must call fit_transform first"
cleaned = [clean_text(t) for t in texts]
tfidf_matrix = self.tfidf.transform(cleaned)
return self.svd.transform(tfidf_matrix)
@staticmethod
def metadata_features(texts: List[str]) -> pd.DataFrame:
"""Lightweight structural features that don't require a trained model."""
df = pd.DataFrame({"text": texts})
df["char_length"] = df["text"].str.len()
df["word_count"] = df["text"].str.split().str.len()
df["avg_word_length"] = df["char_length"] / (df["word_count"] + 1)
df["has_numbers"] = df["text"].str.contains(r"\d").astype(int)
df["exclamation_count"] = df["text"].str.count("!")
df["question_count"] = df["text"].str.count(r"\?")
df["uppercase_ratio"] = df["text"].apply(
lambda t: sum(1 for c in t if c.isupper()) / max(len(t), 1)
)
return df.drop(columns=["text"])
# BM25 query-document similarity as a feature
class BM25Feature:
"""Use BM25 score as a feature for ranking models."""
def __init__(self):
self.bm25 = None
self.corpus_tokenized = None
def fit(self, documents: List[str]) -> "BM25Feature":
self.corpus_tokenized = [clean_text(d).split() for d in documents]
self.bm25 = BM25Okapi(self.corpus_tokenized)
return self
def score(self, query: str) -> np.ndarray:
"""Return BM25 score for all documents given a query."""
query_tokens = clean_text(query).split()
return self.bm25.get_scores(query_tokens)
Embedding-Based Features: Sentence Transformers
Sentence transformers encode text into dense, fixed-size vectors where semantic similarity corresponds to geometric proximity. This is the architecture that solved the vocabulary gap problem.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import torch
import hashlib
import json
import pickle
from pathlib import Path
class EmbeddingFeatureExtractor:
"""
Production-grade embedding extractor with:
- Batched encoding for throughput
- Disk-based caching to avoid redundant computation
- Graceful fallback on encoding failure
"""
def __init__(
self,
model_name: str = "BAAI/bge-small-en-v1.5", # 33M params, fast + accurate
cache_dir: str = ".embedding_cache",
batch_size: int = 256,
max_length: int = 512
):
self.model = SentenceTransformer(model_name)
self.model_name = model_name
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
self.batch_size = batch_size
self.max_length = max_length
# Use GPU if available
if torch.cuda.is_available():
self.model = self.model.cuda()
def _cache_key(self, text: str) -> str:
"""Deterministic cache key for a text string."""
content = f"{self.model_name}::{text}"
return hashlib.sha256(content.encode()).hexdigest()
def _load_from_cache(self, text: str):
key = self._cache_key(text)
cache_path = self.cache_dir / f"{key}.pkl"
if cache_path.exists():
with open(cache_path, "rb") as f:
return pickle.load(f)
return None
def _save_to_cache(self, text: str, embedding: np.ndarray):
key = self._cache_key(text)
cache_path = self.cache_dir / f"{key}.pkl"
with open(cache_path, "wb") as f:
pickle.dump(embedding, f)
def encode(self, texts: List[str], use_cache: bool = True) -> np.ndarray:
"""
Encode texts to embeddings with caching.
Returns: array of shape (len(texts), embedding_dim)
"""
embeddings = [None] * len(texts)
to_encode = [] # (original_index, text) pairs that need encoding
# Check cache first
if use_cache:
for i, text in enumerate(texts):
cached = self._load_from_cache(clean_text(text))
if cached is not None:
embeddings[i] = cached
else:
to_encode.append((i, text))
else:
to_encode = list(enumerate(texts))
# Batch encode uncached texts
if to_encode:
indices, raw_texts = zip(*to_encode)
cleaned_texts = [clean_text(t) for t in raw_texts]
# Encode in batches to manage memory
batch_embeddings = self.model.encode(
cleaned_texts,
batch_size=self.batch_size,
show_progress_bar=len(cleaned_texts) > 1000,
normalize_embeddings=True, # L2 normalize for cosine similarity
convert_to_numpy=True,
truncate_dim=None,
)
for idx, original_idx in enumerate(indices):
emb = batch_embeddings[idx]
embeddings[original_idx] = emb
if use_cache:
self._save_to_cache(cleaned_texts[idx], emb)
return np.array(embeddings)
def query_product_similarity(
self,
query: str,
product_titles: List[str]
) -> np.ndarray:
"""Compute cosine similarity between one query and many products."""
query_emb = self.encode([query]) # (1, dim)
product_embs = self.encode(product_titles) # (n, dim)
return cosine_similarity(query_emb, product_embs).flatten() # (n,)
Text Cleaning Pipeline
The quality of any text feature - classical or embedding-based - depends heavily on the upstream cleaning. Raw text from user-generated content contains noise that degrades feature quality.
import langdetect
from langdetect import detect, LangDetectException
def detect_language(text: str) -> str:
"""Detect language with fallback to 'unknown'."""
try:
return detect(text)
except LangDetectException:
return "unknown"
def build_text_cleaning_pipeline():
"""
Returns a function that applies a full text cleaning pipeline.
Handles: HTML tags, emoji, repeated characters, language detection.
"""
import html
import unicodedata
def clean(text: str) -> dict:
if not isinstance(text, str) or len(text.strip()) == 0:
return {
"cleaned": "",
"language": "unknown",
"original_char_count": 0,
"cleaned_char_count": 0,
"quality_score": 0.0
}
original_length = len(text)
# Decode HTML entities
text = html.unescape(text)
# Normalize unicode (NFC normalization)
text = unicodedata.normalize("NFC", text)
# Remove HTML tags
text = re.sub(r"<[^>]+>", " ", text)
# Remove URLs
text = re.sub(r"https?://\S+|www\.\S+", " URL ", text)
# Normalize repeated characters (loooove -> love, max 2 repeats)
text = re.sub(r"(.)\1{2,}", r"\1\1", text)
# Detect language before further cleaning
language = detect_language(text[:200]) # sample for speed
cleaned = re.sub(r"[^a-z0-9\s\-'.,!?]", " ", text.lower())
cleaned = re.sub(r"\s+", " ", cleaned).strip()
quality_score = min(1.0, len(cleaned) / max(original_length, 1))
return {
"cleaned": cleaned,
"language": language,
"original_char_count": original_length,
"cleaned_char_count": len(cleaned),
"quality_score": quality_score
}
return clean
Dimensionality Reduction: PCA and UMAP
Embedding vectors are typically 384–1536 dimensions. Training a model on raw embeddings is slow and may cause overfitting when the training set is small. Dimensionality reduction compresses embeddings while retaining most of the variance.
PCA (Principal Component Analysis): Linear projection onto the directions of maximum variance. Fast, deterministic, invertible. Works well when the data lies in a lower-dimensional linear subspace. Typical reduction: 768 → 64–128 dimensions.
UMAP (Uniform Manifold Approximation and Projection): Non-linear reduction that preserves both local and global structure better than PCA for complex manifolds. Slower, stochastic. Better for visualization (2D/3D) and for datasets where linear structure is a poor assumption.
from sklearn.decomposition import PCA
import umap
def reduce_embeddings(
embeddings: np.ndarray,
method: str = "pca",
n_components: int = 64,
seed: int = 42
) -> np.ndarray:
"""
Reduce embedding dimensionality for use as ML features.
embeddings: (n_samples, embedding_dim)
Returns: (n_samples, n_components)
"""
if method == "pca":
reducer = PCA(n_components=n_components, random_state=seed)
reduced = reducer.fit_transform(embeddings)
explained_var = reducer.explained_variance_ratio_.sum()
print(f"PCA: {n_components} components explain {explained_var:.1%} of variance")
return reduced
elif method == "umap":
reducer = umap.UMAP(
n_components=n_components,
n_neighbors=15,
min_dist=0.1,
metric="cosine",
random_state=seed
)
return reducer.fit_transform(embeddings)
else:
raise ValueError(f"Unknown method: {method}. Use 'pca' or 'umap'.")
Embedding Caching in Production
Re-encoding the same product titles every time a query arrives adds unnecessary latency. For a product catalog of 500,000 items, pre-computing and caching all product embeddings is essential.
import redis
import struct
class ProductEmbeddingCache:
"""
Redis-backed embedding cache for product titles.
Serializes float32 embeddings as binary for efficiency.
"""
def __init__(self, redis_url: str, embedding_dim: int, ttl_seconds: int = 86400):
self.redis = redis.from_url(redis_url)
self.embedding_dim = embedding_dim
self.ttl = ttl_seconds
def _serialize(self, embedding: np.ndarray) -> bytes:
return struct.pack(f"{self.embedding_dim}f", *embedding.astype(np.float32))
def _deserialize(self, data: bytes) -> np.ndarray:
values = struct.unpack(f"{self.embedding_dim}f", data)
return np.array(values, dtype=np.float32)
def get(self, product_id: str):
data = self.redis.get(f"emb:{product_id}")
if data is None:
return None
return self._deserialize(data)
def set(self, product_id: str, embedding: np.ndarray):
data = self._serialize(embedding)
self.redis.setex(f"emb:{product_id}", self.ttl, data)
def precompute_catalog(
self,
extractor: EmbeddingFeatureExtractor,
product_df: pd.DataFrame,
id_col: str = "product_id",
text_col: str = "title"
) -> int:
"""Precompute and cache all product embeddings."""
titles = product_df[text_col].tolist()
ids = product_df[id_col].tolist()
# Batch encode all products
embeddings = extractor.encode(titles, use_cache=False)
# Store in Redis
stored = 0
for pid, emb in zip(ids, embeddings):
self.set(pid, emb)
stored += 1
return stored
Production Engineering Notes
Model versioning for embeddings: When you update the embedding model, all cached embeddings become stale and incompatible. Version your embedding cache keys with the model version. After updating, warm up the new cache before cutting over serving traffic.
Latency budget: A sentence transformer inference call takes 5–50ms depending on model size and hardware. For real-time serving, pre-compute query embeddings at the edge if possible, or use a smaller, faster model (BAAI/bge-small-en at 33M parameters vs. all-mpnet-base at 110M).
Embedding drift: Unlike classical TF-IDF features, embedding representations are tied to a specific model. When you fine-tune on new click data, the embedding space shifts. All downstream distance computations and index structures (FAISS, approximate nearest neighbor indices) must be rebuilt. Plan this as a periodic maintenance operation.
Common Mistakes
:::danger Using TF-IDF when vocabulary mismatch is the core problem If users phrase queries differently from how products are described, TF-IDF will have zero recall for those queries - cosine similarity between non-overlapping vocabularies is exactly zero. Before selecting a text representation, measure vocabulary overlap between queries and documents. If it is below 60%, embedding-based representations are not optional - they are the minimum viable approach. :::
:::danger Not caching embeddings for large catalogs Re-encoding a 500,000-item product catalog at query time at 50ms per item would take 7 hours. Pre-compute and cache all catalog embeddings. Only encode the query at request time - which is fast because it is a single short text. :::
:::warning Skipping text cleaning before embedding While embedding models are robust to some noise, HTML tags, URL strings, and repeated characters degrade embedding quality. A product description that starts with three lines of navigation links will have an embedding dominated by those links rather than the actual product content. Always clean text before encoding. :::
:::tip Start with BM25 as a feature, not just as a baseline BM25 captures exact term match, which embedding models can miss when exact keywords matter (model numbers, product codes, SKUs). In production ranking systems, combining BM25 score with embedding similarity as features to a learning-to-rank model consistently outperforms either alone. This is the "hybrid retrieval" approach used by most production search systems. :::
Interview Q&A
Q: What is the difference between TF-IDF and sentence embeddings, and when would you choose each?
A: TF-IDF produces a sparse vector over a fixed vocabulary. Each dimension corresponds to a term, and the value is a function of how often the term appears in the document vs. the corpus. It captures exact term match. Sentence embeddings produce a dense vector in a learned semantic space where similar meanings map to nearby vectors regardless of vocabulary. Choose TF-IDF when: vocabulary is controlled and consistent, interpretability matters, you need fast computation without GPU, or your domain uses specialized terminology that a general embedding model may not encode well. Choose sentence embeddings when: users express queries with different words than document vocabulary uses, you need semantic similarity across paraphrases, or you have training data to fine-tune the encoder for your domain.
Q: What is the vocabulary gap problem in text retrieval and how do embeddings solve it?
A: The vocabulary gap is when a query and a relevant document express the same concept with different words. A query for "jogging shoes" finds zero overlap with a product titled "running trainers" in TF-IDF space - cosine similarity is zero, and the product is not retrieved. Embeddings solve this because they are trained on large corpora where "jogging" and "running" appear in similar contexts, causing their vectors to be close in embedding space. A query embedding for "jogging shoes" will have high cosine similarity with a product embedding for "running trainers" even with zero vocabulary overlap.
Q: How do you make embedding-based text features practical for a real-time serving system?
A: Three key strategies. First, pre-compute and cache: for any catalog or corpus that changes slowly (products, articles, documents), encode all items offline and store embeddings in a fast key-value store (Redis). At serving time, encode only the query - one short text is fast even without a GPU. Second, model selection: use a smaller, faster model for latency-sensitive paths. BAAI/bge-small-en-v1.5 (33M parameters) runs on CPU at 5–10ms per query; the full BAAI/bge-large takes 80–200ms. Third, approximate nearest neighbor search: for retrieval tasks, use FAISS or ScaNN to find the top-K similar embeddings in O(log n) time rather than brute-force cosine similarity over all catalog items.
Q: How do you evaluate whether switching from TF-IDF to embeddings is worth the infrastructure cost?
A: Run an offline A/B comparison using your ranking evaluation metrics (NDCG, MRR, precision@K). Build the embedding-based feature set, retrain the ranking model, and compare test set metrics against the TF-IDF baseline. If the improvement is significant (greater than noise), estimate the infrastructure cost: embedding computation time, cache storage, GPU cost if online encoding is needed, and the engineering cost to build and maintain the caching layer. Compare this cost against the revenue impact of the metric improvement - if a 0.05 NDCG improvement translates to a measurable conversion rate increase, the cost calculation usually favors embeddings. Also consider the incremental cost against the baseline: if you already have GPU infrastructure for other models, adding an embedding encoder is nearly free.
Q: What happens to your text features when the embedding model is updated or retrained?
A: All cached embeddings become invalid - they were computed in a different embedding space. Even small fine-tuning updates shift the geometry enough that old embeddings are incompatible with new ones. The correct process: version all cached embeddings with a model version identifier, pre-compute new embeddings for the full catalog using the updated model (running in parallel with the old model), validate that ranking quality improves on a held-out evaluation set, atomically switch serving to use the new embeddings, and then retire the old version. This is operationally similar to a database migration - you need to run old and new systems simultaneously during the transition. Any FAISS or ANN index built on the old embeddings must also be rebuilt.
