:::tip 🎮 Interactive Playground Visualize this concept: Try the Cascade and Funnel Design demo on the EngineersOfAI Playground - no code required. :::

Cascade and Funnel Architecture

300 Million Products, 50 Milliseconds

Picture the engineering problem facing the Amazon Search team: a user types "wireless headphones under $100." There are roughly 300 million products in Amazon's catalog. The search result page must render in 50 milliseconds. You need to return 24 highly relevant, available, well-priced products from those 300 million.

The mathematically naive approach is to score all 300 million products with your best ranking model and take the top 24. Your best ranking model is a deep neural network that takes 5ms per batch of 100 products on a GPU. At 300 million products per query, this model would require $300M / 100 \times 5ms = 15,000,000$ ms - over four hours - to process a single search. This is not a latency problem. This is a physics problem.

The solution that every major e-commerce, search, and recommendation system uses is the same: cascade. Do not try to run one model on everything. Run a cheap model on everything, a medium model on the surviving candidates, and an expensive model on a small final set. Each stage is faster than the previous one by one to two orders of magnitude. Each stage is also less accurate than the previous one - the cost of speed. The art is in designing the budget allocation: how many candidates survive each stage, and what does the model at each stage optimize for?

Amazon's search cascade in production (rough numbers): 300M products enter retrieval; 5,000 survive candidate generation; 500 survive pre-ranking; 100 survive the full neural ranker; 24 appear on the first page after business rule re-ranking. The total end-to-end latency is under 50ms. The candidate generation stage runs in 5ms. The full neural ranker runs in 20ms on 100 candidates. The cascade made the physics possible.

Understanding cascade architecture is essential for every ML system design interview and for every production ranking system. It is not one technology - it is an architectural principle. The same pattern appears in YouTube recommendations (millions → hundreds), Uber's ETA prediction (many routes → one), and Stripe's fraud detection (every transaction → flagged subset).

Why This Exists

The Quality-Latency Tradeoff at Scale

Every ranking model lives on a quality-latency curve. More features, more layers, more cross-feature interactions → better ranking quality, higher latency per candidate. The tradeoff is a fundamental constraint.

For a fixed latency budget $T$ and a model with per-candidate latency $t$ , the maximum number of candidates you can score is:

$N_{\max} = \lfloor T / t \rfloor$

For a 50ms total budget with 5ms per 100 candidates (most modern dense rankers on GPU): $N_{\max} = 50ms / (5ms/100) = 1,000$ candidates.

You cannot score more than 1,000 candidates with your best model and stay within 50ms. But your catalog has 300 million products. Something has to reduce 300M to 1,000 before the ranker sees it. That is the cascade.

Why Not Just Cache Everything?

A common question: why not pre-rank all products for all queries and cache results? Two problems:

Query space is unbounded. User queries are free-form text. Amazon sees hundreds of millions of distinct queries. Pre-ranking is only feasible for the top ~100K frequent queries. For the long tail - which represents the majority of revenue - pre-ranking is impractical.

Freshness requirements. Product prices change hourly. Inventory changes in real time. A pre-ranked cache from yesterday recommends products that are now out of stock at the old price. Real-time ranking at each query is a freshness requirement, not a luxury.

Historical Context

The cascade architecture for web search was formalized in the 2006 paper "Learning to Rank for Information Retrieval" (Liu) and in Microsoft's 2008 LambdaMART system. The early web search cascades used hand-crafted features at each stage (BM25, PageRank, and query-document matching at retrieval; neural point-wise scorers at ranking).

The transition to neural cascades happened around 2015-2016. Google replaced BM25-based retrieval with dense neural retrieval (producing embeddings for documents and queries). Facebook deployed neural cascade ranking for News Feed in 2016 (described in "Practical Lessons from Predicting Clicks on Ads at Facebook"). YouTube described their complete cascade in the 2016 paper "Deep Neural Networks for YouTube Recommendations," which remains the canonical reference for cascade design.

The modern cascade - using two-tower retrieval, lightweight MLP pre-ranking, and full cross-feature neural ranking - became the standard by 2018-2019 and is deployed by essentially every large recommendation and search system today.

Core Concepts

The Full Cascade Pipeline

Stage 1: Candidate Generation / Retrieval

Goal: Recall. Do not miss relevant items.

Latency budget: 5-10ms

Scale: Full catalog → 1K-10K candidates

Methods: ANN search over pre-computed item embeddings (two-tower), BM25 keyword matching, rule-based inclusion (boosted categories, sponsored items, trending items), multiple retrieval engines fused with RRF.

Key design decision: Err on the side of recall. It is acceptable to return irrelevant items here - the subsequent stages will filter them. But if a relevant item is not in the candidate set, it has no chance of reaching the user.

Recall@K is the success metric. Target recall@5K of at least 95% on held-out interaction data.

Stage 2: Pre-Ranking

Goal: Aggressive recall-preserving filtering with cheap models.

Latency budget: 10-15ms

Scale: 5K candidates → 200-500 candidates

Methods: Lightweight models that use few features - typically a small MLP (2-3 layers, 64-256 hidden units) with categorical features only (no expensive text processing or image features). The model scores each candidate independently (point-wise scoring).

Key design decision: Use features that are already precomputed or trivially computable. Avoid any feature that requires a real-time lookup (database call, cross-candidate aggregation). The features that typically work well here: item popularity score, category match score, historical CTR, price range match, item embedding similarity to query embedding (precomputed).

Stage 3: Full Ranking

Goal: Quality. Get the ordering right.

Latency budget: 20-30ms

Scale: 200-500 candidates → 50-100 candidates

Methods: Full neural ranking model with rich cross-feature interactions. Typical architecture: DCN (Deep & Cross Network) or DLRM (Deep Learning Recommendation Model). Features include: user history embeddings, item content features (text, image), user-item interaction features (has the user clicked this seller before?), context features (time of day, device type).

Key design decision: This is where you invest modeling quality. Budget 70% of the total per-candidate feature compute budget here. Use listwise or pairwise training objectives (LambdaRank, ListNet) rather than pointwise - they optimize the ranking order, not just absolute scores.

Stage 4: Re-Ranking (Post-Processing)

Goal: Apply constraints that the ranking model cannot optimize.

Latency budget: 5ms

Scale: 50-100 candidates → 20-24 final results

Methods: Deterministic rule application - not a model. Typical constraints:

Diversity: no more than 3 items from the same seller on the first page
Business rules: sponsored items must appear in positions 1, 5, 9 per policy
Inventory: remove out-of-stock items
Legal/compliance: remove items restricted for this user's region
Safety: remove items flagged by trust and safety models

Key design decision: Do not put business logic in the ranking model. Business rules change frequently; models are slow to retrain and deploy. Keep rule application in deterministic post-processing code where it can be updated without model changes.

Budget Allocation Principles

The cascade compression ratio at each stage is a design decision with major quality implications:

Rule of 10: A common heuristic is to reduce by 10x at each stage. 10K → 1K → 100 → 10. This gives roughly equal latency at each stage if each stage's model is 10x more expensive per candidate than the previous.

Precision-weighted budget: Spend more budget where precision matters most. The transition from retrieval to pre-ranking is not critical - you have 5K candidates, so a false negative (missing a relevant item) is less costly because there are many other relevant items in the set. The transition from ranking to re-ranking is critical - with 100 candidates left, every false negative costs you a spot in the final 24.

Monotonic compression is not required: Some systems use wider funnels at specific stages. If your pre-ranker is especially good (high correlation with the full ranker's order), you can keep more candidates. If your retrieval is less reliable (new catalog, cold start), keep more candidates from retrieval.

Code Examples

Pre-Ranker: Lightweight Point-Wise Scorer

import torch
import torch.nn as nn
from typing import List, Dict, Tuple
import numpy as np


class PreRanker(nn.Module):
    """
    Lightweight point-wise pre-ranker.
    Designed for low-latency scoring of 5K candidates.
    Uses only precomputed features - no real-time lookups.
    """

    def __init__(
        self,
        n_categorical_features: int,
        categorical_dims: List[int],  # vocabulary sizes
        n_continuous_features: int = 8,
        hidden_dim: int = 64,
        embed_dim: int = 8,
    ):
        super().__init__()
        # Small embeddings for categorical features
        self.cat_embeds = nn.ModuleList([
            nn.Embedding(dim, embed_dim, padding_idx=0)
            for dim in categorical_dims
        ])

        total_input_dim = n_categorical_features * embed_dim + n_continuous_features
        self.mlp = nn.Sequential(
            nn.Linear(total_input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
        )

    def forward(
        self,
        categorical_features: torch.Tensor,  # (batch, n_cat)
        continuous_features: torch.Tensor,  # (batch, n_cont)
    ) -> torch.Tensor:
        cat_embedded = []
        for i, embed in enumerate(self.cat_embeds):
            cat_embedded.append(embed(categorical_features[:, i]))
        cat_vec = torch.cat(cat_embedded, dim=-1)

        combined = torch.cat([cat_vec, continuous_features], dim=-1)
        return self.mlp(combined).squeeze(-1)

    @torch.no_grad()
    def score_candidates(
        self,
        candidates: List[Dict],
        batch_size: int = 1024,
    ) -> List[Tuple[str, float]]:
        """Score all candidates and return (item_id, score) sorted by score."""
        all_scores = []
        all_ids = [c["item_id"] for c in candidates]

        for i in range(0, len(candidates), batch_size):
            batch = candidates[i : i + batch_size]
            cat_feats = torch.tensor(
                [[c["category_id"], c["seller_tier"], c["price_bucket"]] for c in batch],
                dtype=torch.long,
            )
            cont_feats = torch.tensor(
                [[c["ctr_7d"], c["cvr_7d"], c["rating"], c["embed_similarity"]] for c in batch
                 ] + [[0.0, 0.0, 0.0, 0.0]] * max(0, batch_size - len(batch)),  # padding
                dtype=torch.float32,
            )[:len(batch)]
            scores = self.forward(cat_feats, cont_feats)
            all_scores.extend(scores.numpy().tolist())

        return sorted(
            zip(all_ids, all_scores),
            key=lambda x: x[1],
            reverse=True,
        )


def pre_rank(candidates: List[Dict], model: PreRanker, keep_k: int = 500) -> List[Dict]:
    """Run pre-ranking and return top-K candidates."""
    scored = model.score_candidates(candidates)
    top_k_ids = {item_id for item_id, _ in scored[:keep_k]}
    return [c for c in candidates if c["item_id"] in top_k_ids]

Full Ranker: Deep & Cross Network (DCN)

class CrossNetwork(nn.Module):
    """Cross network from DCN-V2: models feature interactions explicitly."""

    def __init__(self, input_dim: int, n_layers: int = 3):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(input_dim, input_dim) for _ in range(n_layers)
        ])

    def forward(self, x0: torch.Tensor) -> torch.Tensor:
        """
        Cross network: x_l+1 = x0 * W(x_l) + b + x_l
        Explicitly models all pairwise feature interactions.
        """
        x = x0
        for layer in self.layers:
            x = x0 * layer(x) + x
        return x


class DeepCrossNetwork(nn.Module):
    """
    DCN-V2: parallel deep + cross network for ranking.
    Cross network captures explicit feature interactions.
    Deep network captures implicit higher-order interactions.
    """

    def __init__(self, input_dim: int, cross_layers: int = 3, deep_dims: List[int] = [256, 128, 64]):
        super().__init__()
        self.cross_net = CrossNetwork(input_dim, n_layers=cross_layers)

        deep_layers = []
        prev_dim = input_dim
        for dim in deep_dims:
            deep_layers.extend([nn.Linear(prev_dim, dim), nn.ReLU()])
            prev_dim = dim
        self.deep_net = nn.Sequential(*deep_layers)

        # Combine cross + deep outputs
        combined_dim = input_dim + prev_dim
        self.output_layer = nn.Linear(combined_dim, 1)

    def forward(self, features: torch.Tensor) -> torch.Tensor:
        cross_out = self.cross_net(features)
        deep_out = self.deep_net(features)
        combined = torch.cat([cross_out, deep_out], dim=-1)
        return self.output_layer(combined).squeeze(-1)

Re-Ranker: Diversity + Business Rules

def apply_diversity_constraint(
    ranked_candidates: List[Dict],
    max_per_seller: int = 3,
    final_size: int = 24,
) -> List[Dict]:
    """
    Greedy diversity constraint: limit items per seller in final results.
    Iterates through ranked list, skipping items from sellers already at their quota.
    """
    seller_counts: Dict[str, int] = {}
    final_results = []

    for candidate in ranked_candidates:
        if len(final_results) >= final_size:
            break
        seller_id = candidate["seller_id"]
        if seller_counts.get(seller_id, 0) < max_per_seller:
            final_results.append(candidate)
            seller_counts[seller_id] = seller_counts.get(seller_id, 0) + 1

    return final_results


def apply_sponsored_positions(
    organic_results: List[Dict],
    sponsored_results: List[Dict],
    sponsored_positions: List[int] = [0, 4, 8],  # 0-indexed
) -> List[Dict]:
    """Insert sponsored results into specified positions in the organic ranking."""
    final = list(organic_results)
    for pos, sponsored in zip(sponsored_positions, sponsored_results):
        if pos <= len(final):
            final.insert(pos, {**sponsored, "is_sponsored": True})
    return final[:24]

Production Engineering Notes

Quality Loss Across the Cascade

Each stage introduces quality loss - items that would rank high in the final ranker but get filtered out early. Measuring cascade quality loss is critical:

Stage-by-stage recall measurement: For each stage, compute what fraction of the final top-K results from the full ranker survive. If your full ranker's top-24 has 90% recall after pre-ranking, and 95% recall after retrieval, then pre-ranking is the bottleneck.

Oracle analysis: Run the full ranker on the retrieval output (bypassing pre-ranking) on a sample of queries. Compare the top-24 with your production cascade output. The gap tells you the headroom available from improving each intermediate stage.

Consistency Between Stages

A subtle problem: training each stage independently can create inconsistencies. If the pre-ranker is trained on data where the training label is "did the user click?", it may filter out items that have low CTR but high purchase rate. The full ranker may optimize for both - but it never sees those items because the pre-ranker removed them.

Solution: Train later stages on the output distribution of earlier stages. The pre-ranker should be trained on the items that survive retrieval, not on all items. The full ranker should be trained on items that survive pre-ranking. This is called stage-conditioned training and it significantly reduces the inconsistency problem.

Cascade at Inference: Parallelism

Multiple retrieval engines (BM25, dense, rule-based) can run in parallel. Their outputs are merged before pre-ranking. The cascade itself is sequential (pre-ranking waits for retrieval output). Parallelism within stages accelerates the cascade:

import asyncio
from typing import List


async def parallel_retrieval(query: str, retrievers: List) -> List[Dict]:
    """Run multiple retrieval engines in parallel and merge results."""
    tasks = [retriever.retrieve(query) for retriever in retrievers]
    results = await asyncio.gather(*tasks)

    # Merge with RRF
    all_ranked_lists = [list(enumerate(r)) for r in results]
    merged = reciprocal_rank_fusion(all_ranked_lists)
    return merged


async def cascade_ranking(query: str, user_context: Dict) -> List[Dict]:
    """Full cascade pipeline with parallel retrieval."""
    # Stage 1: Parallel retrieval (5ms)
    candidates = await parallel_retrieval(query, [dense_retriever, bm25_retriever])

    # Stage 2: Pre-ranking (10ms) - sequential after retrieval
    pre_ranked = pre_rank(candidates, pre_ranker, keep_k=500)

    # Stage 3: Full ranking (25ms)
    ranked = full_ranker.score_and_rank(pre_ranked, user_context, top_k=100)

    # Stage 4: Re-ranking (5ms)
    final = apply_diversity_constraint(ranked, max_per_seller=3, final_size=24)
    return final

Common Mistakes

danger

Mistake: Training all stages on the same raw interaction data.

Each stage sees a biased subset of items - only items that survived all previous stages. If you train the ranker on raw click logs without accounting for the pre-ranker's filtering, the ranker learns from a non-representative sample. Items that the pre-ranker systematically filters out (even relevant ones) never generate training signal for the ranker. Always train each stage on the output distribution of the preceding stage.

danger

Mistake: Using the same features in pre-ranking and full ranking.

Pre-ranking exists to do fast approximate scoring. If it uses the same expensive features as full ranking, you've doubled your compute cost without adding throughput. Pre-rankers should use only precomputed, cheap-to-fetch features. Move all real-time feature computation to the full ranking stage where it is applied to the small candidate set.

warning

Mistake: Over-pruning at the retrieval stage.

If your retrieval stage returns too few candidates (500 instead of 5,000), the later stages have no room to find the best items. The quality ceiling is set by retrieval recall - no downstream model can recover an item that was dropped at retrieval. When in doubt, keep more candidates at retrieval and rely on pre-ranking to filter aggressively.

tip

Tip: Monitor the cascade compression ratio in production.

Track what fraction of candidates survive each stage in real time. If pre-ranking compression increases from 10x to 30x, something is wrong - either the pre-ranker changed behavior or the retrieval is returning different items. Cascade compression ratios are leading indicators of quality problems that emerge before user metrics degrade.

Interview Q&A

Q: Why does Amazon need four stages of ranking instead of one good model?

A: The physics of computation make a single-stage approach impossible at 300M items with a 50ms budget. The best ranking model takes ~5ms to score 100 candidates on a GPU. At 300M candidates, that is 15 million seconds per query - physically impossible. The cascade solves this by using increasingly accurate but increasingly expensive models on decreasing sets of candidates. The retrieval stage (ANN search, 5ms) brings 300M to 5K. Pre-ranking (cheap MLP, 10ms) brings 5K to 500. Full ranking (deep neural net, 25ms) brings 500 to 100. Each stage is tractable given its input size. The total cost is dominated by the most expensive stage applied to the smallest set. This is orders of magnitude cheaper than any single-stage approach that could theoretically handle the full catalog.

Q: How do you detect when cascade quality loss becomes a problem?

A: The key metric is stage-by-stage recall of the final result set. For a sample of queries, run the full ranker directly on the retrieval output (bypassing pre-ranking) and compare the top-24 results to what the production cascade produces. If they differ significantly, pre-ranking is causing quality loss. Measure this continuously - it should be a dashboard metric. In practice, if your pre-ranking recall of the final top-24 drops below 85%, it is time to improve the pre-ranker (more features, better model, higher candidate cutoff).

Q: How do you handle diversity requirements in a cascade without hurting relevance?

A: Apply diversity constraints in the re-ranking stage, after the full ranker has determined the relevance ordering. Do not bake diversity into the ranker's loss function - it is harder to control, slower to update, and mixes two different objectives that are easier to optimize separately. The re-ranking stage applies greedy diversity: iterate through the ranked list, select items that don't violate diversity constraints, stop when you have the target number of results. This guarantees diversity while minimizing relevance cost because you apply diversity constraints only at the end, after you have the best relevance-ordered list.

Q: At what scale does a cascade architecture become necessary?

A: As a rough guideline, once your catalog exceeds 100,000 items and your latency requirement is under 100ms, a two-stage cascade (retrieval + ranking) is typically necessary. At 10M+ items, three stages become necessary. At 100M+ items, four stages are standard. The precise threshold depends on your hardware (faster GPUs push the threshold higher), your model complexity (simpler models allow more single-stage coverage), and your latency SLA (stricter SLAs require earlier cascade adoption). The most common mistake is building a single-stage system, scaling successfully to 1M items, and then being surprised at 10M items when you have to redesign the entire ranking pipeline. Plan the cascade architecture from the start if you expect to scale beyond 100K items.

Q: How do you train a ranker when the training data is collected under the old ranker's policy?

A: This is the exposure bias problem. The ranker's training data consists of items that the previous ranker exposed to users. Items that the old ranker ranked low never received user interactions, so we have no label for them - even if the new ranker would rank them higher and users would interact with them. Several approaches: counterfactual correction using inverse propensity weighting (weight each training example by the inverse probability that the old ranker would have shown that item); exploration (randomly insert unseen items into results to collect unbiased labels); offline evaluation with synthetic data (train on simulated interactions). In practice, a combination of counterfactual correction and small-scale exploration is used. Pure counterfactual correction without exploration underweights items that the old ranker never showed and overweights items the old ranker always showed.

300 Million Products, 50 Milliseconds​

Why This Exists​

The Quality-Latency Tradeoff at Scale​

Why Not Just Cache Everything?​

Historical Context​

Core Concepts​

The Full Cascade Pipeline​

Stage 1: Candidate Generation / Retrieval​

Stage 2: Pre-Ranking​

Stage 3: Full Ranking​

Stage 4: Re-Ranking (Post-Processing)​

Budget Allocation Principles​

Code Examples​

Pre-Ranker: Lightweight Point-Wise Scorer​

Full Ranker: Deep & Cross Network (DCN)​

Re-Ranker: Diversity + Business Rules​

Production Engineering Notes​

Quality Loss Across the Cascade​

Consistency Between Stages​

Cascade at Inference: Parallelism​

Common Mistakes​

Interview Q&A​