What is text features machine learning?

Turning text into ML features - from TF-IDF baselines to embedding-based representations that improved e-commerce search NDCG by 18%.

How does TF-IDF work in practice?

Text Features for ML covers text features machine learning, TF-IDF, BM25 from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/feature-engineering/time-and-date-features

What is the difference between text features machine learning and BM25?

See the full breakdown at https://engineersofai.com/docs/mlops/feature-engineering/time-and-date-features

Text Features for ML

The Search That Couldn't Find What Users Wanted

The e-commerce search team had a NDCG@10 of 0.61 - reasonable by industry standards, but below the 0.72 target that engineering and product had agreed on. The model was a learning-to-rank system taking product title, category, and price as inputs, with user query as the primary text signal.

The team's first hypothesis: the model needs more data. They collected six additional months of click and purchase signals. NDCG improved to 0.63. Not enough. Second hypothesis: better model architecture. They experimented with LambdaMART, then a neural ranker. NDCG reached 0.65. Still not at target.

A senior engineer pointed out that neither intervention had changed the text representation. The query and product title were being encoded with TF-IDF computed over the product catalog - a sparse bag-of-words representation from 2001. If a user searched for "running shoes" and the product title said "jogging trainers," the TF-IDF vectors would have zero overlap. No shared vocabulary, no score, no ranking.

The team replaced TF-IDF with a bi-encoder: separate sentence transformer encoders for query and product, fine-tuned on click data. Query-product similarity was now computed in embedding space, where "running shoes" and "jogging trainers" map to nearby vectors. NDCG jumped to 0.79 in two weeks - an 18% improvement over baseline, clearing the target.

This lesson covers the complete journey from classical text features to production embedding pipelines, and the engineering decisions that separate a research notebook from a reliable feature system.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Engineering demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: The Vocabulary Problem

Classical text representations treat each unique word as an independent dimension. A TF-IDF vector for "running shoes" has a component for "running" and a component for "shoes" - and nothing for "jogging" or "trainers." Two texts that mean the same thing with different words get orthogonal representations. The cosine similarity between them is zero.

This works adequately when the vocabulary is controlled and consistent - legal documents, technical specifications, medical records with standardized terminology. It fails when users express the same intent with varied language, which is every search query ever typed.

Embedding-based representations solve this by mapping text to a dense vector space learned from a large corpus. Semantically similar texts map to geometrically nearby vectors regardless of surface-level word overlap. The vector for "running shoes" is close to the vector for "jogging trainers" because both appear in similar contexts across millions of documents.

The shift from sparse TF-IDF to dense embeddings is the most significant improvement available in text feature engineering. But embeddings are not free - they require more compute to produce, more storage to cache, and more infrastructure to serve at low latency. Understanding both approaches, their trade-offs, and their production implications is what this lesson is about.

Historical Context

TF-IDF (Term Frequency-Inverse Document Frequency) was developed in the 1970s through work by Karen Spärck Jones (IDF, 1972) and Gerard Salton (vector space model, 1975). It remained the dominant text representation for information retrieval for three decades.

BM25 (Best Match 25) was introduced by Robertson et al. in 1994 as a probabilistically motivated improvement to TF-IDF, with better handling of document length normalization and term saturation. It remains the state of the art for classical (non-neural) retrieval and is the default ranking function in Elasticsearch and Apache Lucene.

Word2Vec (Mikolov et al., 2013) introduced learned dense word embeddings, showing that words with similar meanings have geometrically similar vector representations. This opened the door to semantic similarity computation.

Sentence-BERT (Reimers & Gurevych, 2019) extended the embedding approach to full sentences using siamese BERT networks, making it practical to encode and compare arbitrary text at sentence level. The Sentence Transformers library (built on this work) became the standard tool for semantic similarity and retrieval tasks.

OpenAI's text-embedding-ada-002 (2022) demonstrated that very large general-purpose embedding models could match or exceed task-specific fine-tuned models on many benchmarks, further democratizing embedding-based text features.

Core Concepts

Classical Text Features: TF-IDF and BM25

TF-IDF scores each term in a document by combining its term frequency in the document with the inverse of how often it appears across all documents:

$\text{TF-IDF}(t, d, D) = \underbrace{\frac{f_{t,d}}{\sum_{t'} f_{t',d}}}_{\text{term frequency}} \times \underbrace{\log\frac{|D|}{|\{d \in D : t \in d\}|}}_{\text{inverse document frequency}}$

High TF-IDF: a word that appears frequently in this document but rarely in the corpus - a "distinctive" term.

BM25 improves on TF-IDF with term saturation (diminishing returns for high term frequency) and document length normalization:

$\text{BM25}(t, d) = \text{IDF}(t) \cdot \frac{f_{t,d} \cdot (k_1 + 1)}{f_{t,d} + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}$

Typical values: $k_1 = 1.5$ , $b = 0.75$ .

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from rank_bm25 import BM25Okapi
import re
from typing import List

def clean_text(text: str) -> str:
    """Standard text cleaning pipeline for ML features."""
    if not isinstance(text, str):
        return ""
    # Lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r"http\S+|www\S+", " ", text)
    # Remove email addresses
    text = re.sub(r"\S+@\S+", " ", text)
    # Remove special characters but keep hyphens and apostrophes
    text = re.sub(r"[^a-z0-9\s\-']", " ", text)
    # Collapse multiple spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

class TextFeatureExtractor:
    """
    Combined classical and metadata text feature extraction.
    Produces a feature matrix that can be fed directly to a ranking model.
    """
    def __init__(
        self,
        max_tfidf_features: int = 50000,
        lsa_components: int = 100,    # dimensionality after SVD
    ):
        self.tfidf = TfidfVectorizer(
            max_features=max_tfidf_features,
            ngram_range=(1, 2),       # unigrams + bigrams
            min_df=5,                 # ignore very rare terms
            max_df=0.95,              # ignore very common terms
            sublinear_tf=True,        # log(1+tf) instead of raw tf
            strip_accents="unicode",
        )
        self.svd = TruncatedSVD(n_components=lsa_components, random_state=42)
        self.fitted = False

    def fit_transform(self, texts: List[str]) -> np.ndarray:
        """Fit TF-IDF + LSA on corpus and return dense feature matrix."""
        cleaned = [clean_text(t) for t in texts]
        tfidf_matrix = self.tfidf.fit_transform(cleaned)
        lsa_matrix = self.svd.fit_transform(tfidf_matrix)
        self.fitted = True
        return lsa_matrix    # shape: (n_docs, lsa_components)

    def transform(self, texts: List[str]) -> np.ndarray:
        """Transform new texts using fitted TF-IDF + LSA."""
        assert self.fitted, "Must call fit_transform first"
        cleaned = [clean_text(t) for t in texts]
        tfidf_matrix = self.tfidf.transform(cleaned)
        return self.svd.transform(tfidf_matrix)

    @staticmethod
    def metadata_features(texts: List[str]) -> pd.DataFrame:
        """Lightweight structural features that don't require a trained model."""
        df = pd.DataFrame({"text": texts})
        df["char_length"]    = df["text"].str.len()
        df["word_count"]     = df["text"].str.split().str.len()
        df["avg_word_length"] = df["char_length"] / (df["word_count"] + 1)
        df["has_numbers"]    = df["text"].str.contains(r"\d").astype(int)
        df["exclamation_count"] = df["text"].str.count("!")
        df["question_count"]    = df["text"].str.count(r"\?")
        df["uppercase_ratio"]   = df["text"].apply(
            lambda t: sum(1 for c in t if c.isupper()) / max(len(t), 1)
        )
        return df.drop(columns=["text"])


# BM25 query-document similarity as a feature
class BM25Feature:
    """Use BM25 score as a feature for ranking models."""
    def __init__(self):
        self.bm25 = None
        self.corpus_tokenized = None

    def fit(self, documents: List[str]) -> "BM25Feature":
        self.corpus_tokenized = [clean_text(d).split() for d in documents]
        self.bm25 = BM25Okapi(self.corpus_tokenized)
        return self

    def score(self, query: str) -> np.ndarray:
        """Return BM25 score for all documents given a query."""
        query_tokens = clean_text(query).split()
        return self.bm25.get_scores(query_tokens)

Embedding-Based Features: Sentence Transformers

Sentence transformers encode text into dense, fixed-size vectors where semantic similarity corresponds to geometric proximity. This is the architecture that solved the vocabulary gap problem.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import torch
import hashlib
import json
import pickle
from pathlib import Path

class EmbeddingFeatureExtractor:
    """
    Production-grade embedding extractor with:
    - Batched encoding for throughput
    - Disk-based caching to avoid redundant computation
    - Graceful fallback on encoding failure
    """
    def __init__(
        self,
        model_name: str = "BAAI/bge-small-en-v1.5",   # 33M params, fast + accurate
        cache_dir: str = ".embedding_cache",
        batch_size: int = 256,
        max_length: int = 512
    ):
        self.model = SentenceTransformer(model_name)
        self.model_name = model_name
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.batch_size = batch_size
        self.max_length = max_length

        # Use GPU if available
        if torch.cuda.is_available():
            self.model = self.model.cuda()

    def _cache_key(self, text: str) -> str:
        """Deterministic cache key for a text string."""
        content = f"{self.model_name}::{text}"
        return hashlib.sha256(content.encode()).hexdigest()

    def _load_from_cache(self, text: str):
        key = self._cache_key(text)
        cache_path = self.cache_dir / f"{key}.pkl"
        if cache_path.exists():
            with open(cache_path, "rb") as f:
                return pickle.load(f)
        return None

    def _save_to_cache(self, text: str, embedding: np.ndarray):
        key = self._cache_key(text)
        cache_path = self.cache_dir / f"{key}.pkl"
        with open(cache_path, "wb") as f:
            pickle.dump(embedding, f)

    def encode(self, texts: List[str], use_cache: bool = True) -> np.ndarray:
        """
        Encode texts to embeddings with caching.
        Returns: array of shape (len(texts), embedding_dim)
        """
        embeddings = [None] * len(texts)
        to_encode = []   # (original_index, text) pairs that need encoding

        # Check cache first
        if use_cache:
            for i, text in enumerate(texts):
                cached = self._load_from_cache(clean_text(text))
                if cached is not None:
                    embeddings[i] = cached
                else:
                    to_encode.append((i, text))
        else:
            to_encode = list(enumerate(texts))

        # Batch encode uncached texts
        if to_encode:
            indices, raw_texts = zip(*to_encode)
            cleaned_texts = [clean_text(t) for t in raw_texts]

            # Encode in batches to manage memory
            batch_embeddings = self.model.encode(
                cleaned_texts,
                batch_size=self.batch_size,
                show_progress_bar=len(cleaned_texts) > 1000,
                normalize_embeddings=True,   # L2 normalize for cosine similarity
                convert_to_numpy=True,
                truncate_dim=None,
            )

            for idx, original_idx in enumerate(indices):
                emb = batch_embeddings[idx]
                embeddings[original_idx] = emb
                if use_cache:
                    self._save_to_cache(cleaned_texts[idx], emb)

        return np.array(embeddings)

    def query_product_similarity(
        self,
        query: str,
        product_titles: List[str]
    ) -> np.ndarray:
        """Compute cosine similarity between one query and many products."""
        query_emb = self.encode([query])                          # (1, dim)
        product_embs = self.encode(product_titles)                # (n, dim)
        return cosine_similarity(query_emb, product_embs).flatten()  # (n,)

Text Cleaning Pipeline

The quality of any text feature - classical or embedding-based - depends heavily on the upstream cleaning. Raw text from user-generated content contains noise that degrades feature quality.

import langdetect
from langdetect import detect, LangDetectException

def detect_language(text: str) -> str:
    """Detect language with fallback to 'unknown'."""
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

def build_text_cleaning_pipeline():
    """
    Returns a function that applies a full text cleaning pipeline.
    Handles: HTML tags, emoji, repeated characters, language detection.
    """
    import html
    import unicodedata

    def clean(text: str) -> dict:
        if not isinstance(text, str) or len(text.strip()) == 0:
            return {
                "cleaned": "",
                "language": "unknown",
                "original_char_count": 0,
                "cleaned_char_count": 0,
                "quality_score": 0.0
            }

        original_length = len(text)

        # Decode HTML entities
        text = html.unescape(text)

        # Normalize unicode (NFC normalization)
        text = unicodedata.normalize("NFC", text)

        # Remove HTML tags
        text = re.sub(r"<[^>]+>", " ", text)

        # Remove URLs
        text = re.sub(r"https?://\S+|www\.\S+", " URL ", text)

        # Normalize repeated characters (loooove -> love, max 2 repeats)
        text = re.sub(r"(.)\1{2,}", r"\1\1", text)

        # Detect language before further cleaning
        language = detect_language(text[:200])   # sample for speed

        cleaned = re.sub(r"[^a-z0-9\s\-'.,!?]", " ", text.lower())
        cleaned = re.sub(r"\s+", " ", cleaned).strip()

        quality_score = min(1.0, len(cleaned) / max(original_length, 1))

        return {
            "cleaned": cleaned,
            "language": language,
            "original_char_count": original_length,
            "cleaned_char_count": len(cleaned),
            "quality_score": quality_score
        }

    return clean

Dimensionality Reduction: PCA and UMAP

Embedding vectors are typically 384–1536 dimensions. Training a model on raw embeddings is slow and may cause overfitting when the training set is small. Dimensionality reduction compresses embeddings while retaining most of the variance.

PCA (Principal Component Analysis): Linear projection onto the directions of maximum variance. Fast, deterministic, invertible. Works well when the data lies in a lower-dimensional linear subspace. Typical reduction: 768 → 64–128 dimensions.

UMAP (Uniform Manifold Approximation and Projection): Non-linear reduction that preserves both local and global structure better than PCA for complex manifolds. Slower, stochastic. Better for visualization (2D/3D) and for datasets where linear structure is a poor assumption.

from sklearn.decomposition import PCA
import umap

def reduce_embeddings(
    embeddings: np.ndarray,
    method: str = "pca",
    n_components: int = 64,
    seed: int = 42
) -> np.ndarray:
    """
    Reduce embedding dimensionality for use as ML features.
    embeddings: (n_samples, embedding_dim)
    Returns: (n_samples, n_components)
    """
    if method == "pca":
        reducer = PCA(n_components=n_components, random_state=seed)
        reduced = reducer.fit_transform(embeddings)
        explained_var = reducer.explained_variance_ratio_.sum()
        print(f"PCA: {n_components} components explain {explained_var:.1%} of variance")
        return reduced

    elif method == "umap":
        reducer = umap.UMAP(
            n_components=n_components,
            n_neighbors=15,
            min_dist=0.1,
            metric="cosine",
            random_state=seed
        )
        return reducer.fit_transform(embeddings)

    else:
        raise ValueError(f"Unknown method: {method}. Use 'pca' or 'umap'.")

Embedding Caching in Production

Re-encoding the same product titles every time a query arrives adds unnecessary latency. For a product catalog of 500,000 items, pre-computing and caching all product embeddings is essential.

import redis
import struct

class ProductEmbeddingCache:
    """
    Redis-backed embedding cache for product titles.
    Serializes float32 embeddings as binary for efficiency.
    """
    def __init__(self, redis_url: str, embedding_dim: int, ttl_seconds: int = 86400):
        self.redis = redis.from_url(redis_url)
        self.embedding_dim = embedding_dim
        self.ttl = ttl_seconds

    def _serialize(self, embedding: np.ndarray) -> bytes:
        return struct.pack(f"{self.embedding_dim}f", *embedding.astype(np.float32))

    def _deserialize(self, data: bytes) -> np.ndarray:
        values = struct.unpack(f"{self.embedding_dim}f", data)
        return np.array(values, dtype=np.float32)

    def get(self, product_id: str):
        data = self.redis.get(f"emb:{product_id}")
        if data is None:
            return None
        return self._deserialize(data)

    def set(self, product_id: str, embedding: np.ndarray):
        data = self._serialize(embedding)
        self.redis.setex(f"emb:{product_id}", self.ttl, data)

    def precompute_catalog(
        self,
        extractor: EmbeddingFeatureExtractor,
        product_df: pd.DataFrame,
        id_col: str = "product_id",
        text_col: str = "title"
    ) -> int:
        """Precompute and cache all product embeddings."""
        titles = product_df[text_col].tolist()
        ids = product_df[id_col].tolist()

        # Batch encode all products
        embeddings = extractor.encode(titles, use_cache=False)

        # Store in Redis
        stored = 0
        for pid, emb in zip(ids, embeddings):
            self.set(pid, emb)
            stored += 1

        return stored

Production Engineering Notes

Model versioning for embeddings: When you update the embedding model, all cached embeddings become stale and incompatible. Version your embedding cache keys with the model version. After updating, warm up the new cache before cutting over serving traffic.

Latency budget: A sentence transformer inference call takes 5–50ms depending on model size and hardware. For real-time serving, pre-compute query embeddings at the edge if possible, or use a smaller, faster model (BAAI/bge-small-en at 33M parameters vs. all-mpnet-base at 110M).

Embedding drift: Unlike classical TF-IDF features, embedding representations are tied to a specific model. When you fine-tune on new click data, the embedding space shifts. All downstream distance computations and index structures (FAISS, approximate nearest neighbor indices) must be rebuilt. Plan this as a periodic maintenance operation.

Common Mistakes

:::danger Using TF-IDF when vocabulary mismatch is the core problem If users phrase queries differently from how products are described, TF-IDF will have zero recall for those queries - cosine similarity between non-overlapping vocabularies is exactly zero. Before selecting a text representation, measure vocabulary overlap between queries and documents. If it is below 60%, embedding-based representations are not optional - they are the minimum viable approach. :::

:::danger Not caching embeddings for large catalogs Re-encoding a 500,000-item product catalog at query time at 50ms per item would take 7 hours. Pre-compute and cache all catalog embeddings. Only encode the query at request time - which is fast because it is a single short text. :::

:::warning Skipping text cleaning before embedding While embedding models are robust to some noise, HTML tags, URL strings, and repeated characters degrade embedding quality. A product description that starts with three lines of navigation links will have an embedding dominated by those links rather than the actual product content. Always clean text before encoding. :::

:::tip Start with BM25 as a feature, not just as a baseline BM25 captures exact term match, which embedding models can miss when exact keywords matter (model numbers, product codes, SKUs). In production ranking systems, combining BM25 score with embedding similarity as features to a learning-to-rank model consistently outperforms either alone. This is the "hybrid retrieval" approach used by most production search systems. :::

Interview Q&A

Q: What is the difference between TF-IDF and sentence embeddings, and when would you choose each?

A: TF-IDF produces a sparse vector over a fixed vocabulary. Each dimension corresponds to a term, and the value is a function of how often the term appears in the document vs. the corpus. It captures exact term match. Sentence embeddings produce a dense vector in a learned semantic space where similar meanings map to nearby vectors regardless of vocabulary. Choose TF-IDF when: vocabulary is controlled and consistent, interpretability matters, you need fast computation without GPU, or your domain uses specialized terminology that a general embedding model may not encode well. Choose sentence embeddings when: users express queries with different words than document vocabulary uses, you need semantic similarity across paraphrases, or you have training data to fine-tune the encoder for your domain.

Q: What is the vocabulary gap problem in text retrieval and how do embeddings solve it?

A: The vocabulary gap is when a query and a relevant document express the same concept with different words. A query for "jogging shoes" finds zero overlap with a product titled "running trainers" in TF-IDF space - cosine similarity is zero, and the product is not retrieved. Embeddings solve this because they are trained on large corpora where "jogging" and "running" appear in similar contexts, causing their vectors to be close in embedding space. A query embedding for "jogging shoes" will have high cosine similarity with a product embedding for "running trainers" even with zero vocabulary overlap.

Q: How do you make embedding-based text features practical for a real-time serving system?

A: Three key strategies. First, pre-compute and cache: for any catalog or corpus that changes slowly (products, articles, documents), encode all items offline and store embeddings in a fast key-value store (Redis). At serving time, encode only the query - one short text is fast even without a GPU. Second, model selection: use a smaller, faster model for latency-sensitive paths. BAAI/bge-small-en-v1.5 (33M parameters) runs on CPU at 5–10ms per query; the full BAAI/bge-large takes 80–200ms. Third, approximate nearest neighbor search: for retrieval tasks, use FAISS or ScaNN to find the top-K similar embeddings in O(log n) time rather than brute-force cosine similarity over all catalog items.

Q: How do you evaluate whether switching from TF-IDF to embeddings is worth the infrastructure cost?

A: Run an offline A/B comparison using your ranking evaluation metrics (NDCG, MRR, precision@K). Build the embedding-based feature set, retrain the ranking model, and compare test set metrics against the TF-IDF baseline. If the improvement is significant (greater than noise), estimate the infrastructure cost: embedding computation time, cache storage, GPU cost if online encoding is needed, and the engineering cost to build and maintain the caching layer. Compare this cost against the revenue impact of the metric improvement - if a 0.05 NDCG improvement translates to a measurable conversion rate increase, the cost calculation usually favors embeddings. Also consider the incremental cost against the baseline: if you already have GPU infrastructure for other models, adding an embedding encoder is nearly free.

Q: What happens to your text features when the embedding model is updated or retrained?

A: All cached embeddings become invalid - they were computed in a different embedding space. Even small fine-tuning updates shift the geometry enough that old embeddings are incompatible with new ones. The correct process: version all cached embeddings with a model version identifier, pre-compute new embeddings for the full catalog using the updated model (running in parallel with the old model), validate that ranking quality improves on a held-out evaluation set, atomically switch serving to use the new embeddings, and then retire the old version. This is operationally similar to a database migration - you need to run old and new systems simultaneously during the transition. Any FAISS or ANN index built on the old embeddings must also be rebuilt.

The Search That Couldn't Find What Users Wanted​

Why This Exists: The Vocabulary Problem​

Historical Context​

Core Concepts​

Classical Text Features: TF-IDF and BM25​

Embedding-Based Features: Sentence Transformers​

Text Cleaning Pipeline​

Dimensionality Reduction: PCA and UMAP​

Embedding Caching in Production​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​