What is ad click prediction?

End-to-end design of a production ad click prediction system - covering Wide and Deep learning, feature engineering at scale, online learning, calibration, and serving under 10ms.

How does CTR prediction work in practice?

Ad Click Prediction at Scale covers ad click prediction, CTR prediction, wide and deep learning from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/case-studies/ad-click-prediction

What is the difference between ad click prediction and wide and deep learning?

See the full breakdown at https://engineersofai.com/docs/ai-systems/case-studies/ad-click-prediction

:::tip 🎮 Interactive Playground Visualize this concept: Try the Ad Click-Through Rate Prediction demo on the EngineersOfAI Playground - no code required. :::

Ad Click Prediction at Scale

The Auction in Milliseconds

Every time a user loads a web page with ad slots, an auction runs in under 100 milliseconds. Google processes approximately 8.5 billion search queries per day. Each query triggers an ad auction. The auction determines: which of the eligible ads should be shown, in what order, and at what price. The central computation that drives this auction is CTR prediction: P(click | user, ad, context). An ad is worth showing if its expected value - the probability of a click multiplied by the revenue from that click - exceeds the threshold.

The engineering challenge is severe. At 8.5 billion queries per day, that is roughly 100,000 queries per second. Each query might have 50 eligible ads. The CTR model must score 50 (query, ad) pairs in under 10 milliseconds, with enough accuracy that the ranking separates genuinely more relevant ads from less relevant ones. A model that is 0.1% more accurate at predicting CTR translates directly into hundreds of millions of dollars in additional revenue. A model that is not calibrated correctly (predicted CTR does not match actual CTR) will misprice the auction, leading to inefficient allocation.

At Facebook (now Meta), the Ads CTR system processes over 5 trillion ad impressions per year. The engineering investment in CTR modeling is massive because the economics are direct: better CTR prediction = better auction allocation = more advertiser value = more ad revenue.

Requirements

Functional requirements:

Predict P(click | user, ad, query, context) for ad ranking and pricing
Score all eligible ads for a query within the auction window
Update the model to reflect new ads and user behavior patterns
Calibrate predictions so that predicted CTR matches actual CTR

Non-functional requirements:

Latency: p99 under 10ms per ad scoring pass (not the full auction)
Throughput: 5 million ad scoring requests per second at peak
Freshness: model updated at minimum daily; critical features updated in real time
Calibration: mean predicted CTR within 2% of actual CTR at every decile of predicted scores

The Scale Challenge

At 5 million ad scoring requests per second, with 50 ads per request, the CTR model scores 250 million (user, ad) pairs per second. At a target of 10ms per request, the model must score all 50 ads within 10ms.

This translates to:

Feature lookup: under 3ms for all 50 (user, ad, context) feature vectors
Model inference: under 5ms for 50 forward passes
Overhead (serialization, network): under 2ms

A deep neural network with 10 hidden layers cannot score 50 examples in 5ms on CPU. The CTR model must be architecturally constrained to be fast: typically a 2-4 layer MLP with pre-computed sparse embeddings.

Feature Engineering at Scale

CTR models live and die by their features. The most impactful features in large-scale CTR systems are:

Cross features are the most important category. A user interested in fitness who sees a gym membership ad is much more likely to click than the same user seeing a car insurance ad. But a user with no fitness history who sees a gym ad has a much lower CTR. The interaction between user interests and ad topic is more predictive than either feature alone.

import numpy as np
from sklearn.preprocessing import LabelEncoder
from typing import Optional


class CTRFeatureExtractor:
    """
    Feature extraction for ad click prediction.
    Produces feature vectors for Wide (sparse, cross) and Deep (dense) components.
    """

    def extract_user_features(self, user: dict) -> dict:
        """Features describing the user at request time."""
        return {
            # Categorical (will be embedded)
            "age_group": user.get("age_group", "unknown"),  # 18-24, 25-34, etc.
            "gender": user.get("gender", "unknown"),
            "country": user.get("country", "US"),
            "device_type": user.get("device_type", "desktop"),

            # Behavioral (normalized)
            "ads_clicked_7d": min(user.get("ads_clicked_7d", 0) / 50, 1.0),
            "ads_shown_7d": min(user.get("ads_shown_7d", 0) / 500, 1.0),
            "user_ctr_7d": float(user.get("user_ctr_7d", 0.02)),
            "session_ads_shown": min(user.get("session_ads_shown", 0) / 20, 1.0),
            "session_ads_clicked": min(user.get("session_ads_clicked", 0) / 5, 1.0),

            # Temporal
            "hour_sin": np.sin(2 * np.pi * user.get("hour", 12) / 24),
            "hour_cos": np.cos(2 * np.pi * user.get("hour", 12) / 24),
            "is_weekend": float(user.get("day_of_week", 0) >= 5),
        }

    def extract_ad_features(self, ad: dict) -> dict:
        """Features describing the ad."""
        return {
            # Categorical
            "ad_category": ad.get("category", "other"),
            "ad_format": ad.get("format", "text"),  # text, image, video, carousel
            "advertiser_domain": ad.get("advertiser_domain", "unknown"),

            # Quality signals
            "historical_ctr": float(ad.get("historical_ctr", 0.02)),
            "historical_ctr_7d": float(ad.get("historical_ctr_7d", 0.02)),
            "ad_quality_score": float(ad.get("quality_score", 5.0)) / 10.0,
            "ad_age_days": min(ad.get("age_days", 0) / 365, 1.0),
            "is_new_ad": float(ad.get("impression_count", 0) < 1000),

            # Creative
            "has_image": float(ad.get("has_image", False)),
            "has_video": float(ad.get("has_video", False)),
            "title_length": min(len(ad.get("title", "")) / 100, 1.0),
        }

    def extract_cross_features(self, user: dict, ad: dict, query: str) -> dict:
        """
        Cross features: interactions between user, ad, and query.
        These are the highest-signal features for CTR prediction.
        """
        user_topics = set(user.get("interest_topics", []))
        ad_topics = set(ad.get("topics", []))

        return {
            # Topic overlap: user interests vs ad topics
            "topic_overlap": (
                len(user_topics & ad_topics) / max(len(user_topics | ad_topics), 1)
            ),
            # Has user clicked this advertiser before?
            "user_advertiser_affinity": float(
                ad.get("advertiser_domain", "") in user.get("clicked_advertisers", [])
            ),
            # Query-ad relevance (from text matching / embedding similarity)
            "query_ad_relevance": float(ad.get("relevance_to_query", 0.5)),
            # Device-format match
            "device_format_match": float(
                (user.get("device_type") == "mobile") == (ad.get("format") == "mobile_banner")
            ),
            # User's historical CTR on ads from this category
            "user_category_ctr": float(
                user.get("category_ctr", {}).get(ad.get("category", ""), 0.02)
            ),
        }

    def build_feature_vector(
        self,
        user: dict,
        ad: dict,
        query: str = "",
        feature_encoder: "FeatureEncoder" = None,
    ) -> np.ndarray:
        """Build complete feature vector for Wide and Deep model."""
        user_f = self.extract_user_features(user)
        ad_f = self.extract_ad_features(ad)
        cross_f = self.extract_cross_features(user, ad, query)

        # Dense features (numerical)
        dense = np.array([
            # User features
            user_f["ads_clicked_7d"], user_f["ads_shown_7d"],
            user_f["user_ctr_7d"], user_f["session_ads_shown"],
            user_f["session_ads_clicked"], user_f["hour_sin"],
            user_f["hour_cos"], user_f["is_weekend"],
            # Ad features
            ad_f["historical_ctr"], ad_f["historical_ctr_7d"],
            ad_f["ad_quality_score"], ad_f["ad_age_days"],
            ad_f["is_new_ad"], ad_f["has_image"], ad_f["has_video"],
            ad_f["title_length"],
            # Cross features
            cross_f["topic_overlap"], cross_f["user_advertiser_affinity"],
            cross_f["query_ad_relevance"], cross_f["device_format_match"],
            cross_f["user_category_ctr"],
        ], dtype=np.float32)

        return dense

Wide and Deep Learning Architecture

Google's 2016 Wide and Deep paper introduced the architecture that dominates production CTR prediction. The wide component memorizes specific (user, ad) interaction patterns. The deep component generalizes to unseen combinations.

import torch
import torch.nn as nn
import torch.nn.functional as F


class WideAndDeepCTR(nn.Module):
    """
    Wide and Deep CTR model (Google 2016).
    Wide: linear model on cross-product features (memorization)
    Deep: embedding + MLP (generalization)

    Typical deployment: wide on known user-ad affinities,
    deep on embeddings of all categorical features.
    """

    def __init__(
        self,
        num_users: int,
        num_ads: int,
        num_categories: int,
        num_devices: int,
        num_countries: int,
        num_wide_features: int,  # number of wide (cross-product) features
        dense_feature_dim: int,
        embedding_dim: int = 32,
        hidden_dims: list = None,
    ):
        super().__init__()
        if hidden_dims is None:
            hidden_dims = [1024, 512, 256]

        # Wide component: linear model
        self.wide = nn.Linear(num_wide_features, 1, bias=True)

        # Deep component: embeddings
        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.ad_emb = nn.Embedding(num_ads, embedding_dim)
        self.category_emb = nn.Embedding(num_categories, 16)
        self.device_emb = nn.Embedding(num_devices, 8)
        self.country_emb = nn.Embedding(num_countries, 16)

        # Deep MLP
        deep_input_dim = (
            embedding_dim * 2  # user + ad
            + 16 + 8 + 16       # category + device + country
            + dense_feature_dim  # numerical features
        )

        layers = []
        in_dim = deep_input_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(in_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.3),
            ])
            in_dim = hidden_dim
        layers.append(nn.Linear(in_dim, 1))
        self.deep = nn.Sequential(*layers)

    def forward(
        self,
        user_ids: torch.Tensor,
        ad_ids: torch.Tensor,
        category_ids: torch.Tensor,
        device_ids: torch.Tensor,
        country_ids: torch.Tensor,
        dense_features: torch.Tensor,
        wide_features: torch.Tensor,
    ) -> torch.Tensor:
        # Wide: memorization of cross features
        wide_logit = self.wide(wide_features)

        # Deep: generalization through embeddings
        user_emb = self.user_emb(user_ids)
        ad_emb = self.ad_emb(ad_ids)
        cat_emb = self.category_emb(category_ids)
        dev_emb = self.device_emb(device_ids)
        cty_emb = self.country_emb(country_ids)

        deep_input = torch.cat(
            [user_emb, ad_emb, cat_emb, dev_emb, cty_emb, dense_features],
            dim=1,
        )
        deep_logit = self.deep(deep_input)

        # Final prediction: sigmoid(wide + deep)
        combined_logit = wide_logit + deep_logit
        return torch.sigmoid(combined_logit)


class DeepFMCTR(nn.Module):
    """
    DeepFM (Guo et al., 2017): FM component replaces wide linear model.
    FM captures second-order feature interactions automatically
    (no manual feature engineering needed for cross features).
    """

    def __init__(
        self,
        num_features: int,
        num_fields: int,
        embedding_dim: int = 16,
        hidden_dims: list = None,
    ):
        super().__init__()
        if hidden_dims is None:
            hidden_dims = [400, 400, 400]

        self.embeddings = nn.ModuleList([
            nn.Embedding(n, embedding_dim) for n in num_features
        ])
        self.fm_linear = nn.ModuleList([
            nn.Embedding(n, 1) for n in num_features
        ])

        mlp_input_dim = num_fields * embedding_dim
        layers = []
        in_dim = mlp_input_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(in_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.2),
            ])
            in_dim = hidden_dim
        layers.append(nn.Linear(in_dim, 1))
        self.dnn = nn.Sequential(*layers)

    def forward(self, feature_ids: torch.Tensor) -> torch.Tensor:
        """feature_ids: (batch, num_fields)"""
        # First-order FM term
        linear_terms = [emb(feature_ids[:, i]) for i, emb in enumerate(self.fm_linear)]
        fm_first = torch.cat(linear_terms, dim=1).sum(dim=1, keepdim=True)

        # Second-order FM interaction
        embeddings = [emb(feature_ids[:, i]) for i, emb in enumerate(self.embeddings)]
        emb_stack = torch.stack(embeddings, dim=1)  # (B, num_fields, emb_dim)

        # FM formula: 0.5 * (sum^2 - sum(e^2))
        sum_emb = emb_stack.sum(dim=1)    # (B, emb_dim)
        sum_sq_emb = (emb_stack ** 2).sum(dim=1)  # (B, emb_dim)
        fm_second = 0.5 * (sum_emb ** 2 - sum_sq_emb).sum(dim=1, keepdim=True)

        # DNN component
        dnn_input = emb_stack.view(emb_stack.size(0), -1)  # (B, fields * emb_dim)
        dnn_output = self.dnn(dnn_input)

        # Combined
        return torch.sigmoid(fm_first + fm_second + dnn_output)

Online Learning for Ad CTR

Ad CTR models must be updated frequently. New ads have no historical CTR data. Trending topics shift user interests daily. A model trained monthly would be severely out of date. The solution: online learning updates the model continuously as new click data arrives.

import torch
from torch.optim import SGD
from collections import deque
import threading
import time


class OnlineCTRUpdater:
    """
    Online learning: continuously update the CTR model with new click data.
    Runs as a background service, updating model parameters every few seconds.

    Uses a streaming mini-batch approach:
    - Click events stream from Kafka
    - Non-click events are sampled (click rate is 2%, so 98% of data is non-clicks)
    - Mini-batch SGD updates model parameters
    - Updated model parameters are pushed to the serving layer
    """

    def __init__(
        self,
        model: WideAndDeepCTR,
        learning_rate: float = 0.001,
        batch_size: int = 256,
        update_interval_seconds: float = 5.0,
        non_click_sample_rate: float = 0.05,  # sample 5% of non-clicks for balance
    ):
        self.model = model
        self.optimizer = SGD(
            model.parameters(),
            lr=learning_rate,
            momentum=0.9,
        )
        self.batch_size = batch_size
        self.update_interval = update_interval_seconds
        self.non_click_sample_rate = non_click_sample_rate
        self.event_buffer: deque = deque(maxlen=100_000)
        self.loss_fn = nn.BCELoss()
        self._lock = threading.Lock()

    def add_event(self, event: dict) -> None:
        """
        Add click or non-click event to the buffer.
        Non-clicks are downsampled to avoid overwhelming the buffer.
        """
        is_click = event.get("clicked", False)
        if is_click or (torch.rand(1).item() < self.non_click_sample_rate):
            self.event_buffer.append(event)

    def update_step(self) -> float:
        """
        Run one mini-batch update.
        Returns the loss for monitoring.
        """
        with self._lock:
            if len(self.event_buffer) < self.batch_size:
                return 0.0

            # Sample a mini-batch
            indices = torch.randperm(len(self.event_buffer))[:self.batch_size]
            batch = [self.event_buffer[i] for i in indices]

        # Build tensors from batch
        features = self._build_batch_tensors(batch)
        labels = torch.tensor(
            [float(e["clicked"]) for e in batch], dtype=torch.float32
        )

        self.model.train()
        self.optimizer.zero_grad()
        predictions = self.model(**features).squeeze()
        loss = self.loss_fn(predictions, labels)
        loss.backward()
        self.optimizer.step()
        self.model.eval()

        return float(loss.item())

    def run_update_loop(self) -> None:
        """Background loop: update model every N seconds."""
        while True:
            loss = self.update_step()
            if loss > 0:
                print(f"[OnlineLearning] Loss: {loss:.4f}")
            time.sleep(self.update_interval)

    def _build_batch_tensors(self, batch: list) -> dict:
        """Convert batch of events to model input tensors."""
        # Placeholder - in production, this maps event fields to tensor IDs
        return {}

Calibration: Predicted CTR Must Match Actual CTR

The CTR model produces probabilities, but these probabilities must be accurate (calibrated) for the auction to work correctly. If the model systematically predicts CTR = 0.05 when the actual CTR is 0.02, advertisers bidding on impressions will overpay relative to actual value.

import numpy as np
from sklearn.calibration import calibration_curve


def compute_calibration_error(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    n_bins: int = 10,
) -> float:
    """
    Expected Calibration Error (ECE): the average difference between
    predicted probability and actual fraction positive, weighted by bin size.

    ECE under 0.02 (2%) is considered well-calibrated for CTR systems.
    """
    fractions_of_positives, mean_predicted = calibration_curve(
        y_true, y_pred, n_bins=n_bins, strategy="quantile"
    )

    # Bin sizes (number of samples in each bin)
    bin_sizes = np.histogram(y_pred, bins=n_bins)[0]
    weights = bin_sizes / bin_sizes.sum()

    ece = np.sum(
        weights * np.abs(fractions_of_positives - mean_predicted)
    )
    return float(ece)


class IsotonicCTRCalibrator:
    """
    Isotonic regression calibration for CTR models.
    Monotonically maps raw model scores to calibrated probabilities.

    Train calibrator on a holdout set after training the main model.
    """

    def __init__(self):
        from sklearn.isotonic import IsotonicRegression
        self.calibrator = IsotonicRegression(out_of_bounds="clip")

    def fit(self, y_true: np.ndarray, y_pred: np.ndarray) -> None:
        """Fit calibrator on holdout set."""
        self.calibrator.fit(y_pred, y_true)

    def calibrate(self, raw_scores: np.ndarray) -> np.ndarray:
        """Apply calibration to raw model scores."""
        return self.calibrator.predict(raw_scores)

    def evaluate(self, y_true: np.ndarray, y_pred: np.ndarray) -> dict:
        calibrated = self.calibrate(y_pred)
        raw_ece = compute_calibration_error(y_true, y_pred)
        calibrated_ece = compute_calibration_error(y_true, calibrated)
        return {
            "raw_ece": raw_ece,
            "calibrated_ece": calibrated_ece,
            "improvement": raw_ece - calibrated_ece,
        }

The Exploration Problem: New Ads Need Impressions to Learn CTR

A new ad has no historical CTR data. The CTR model defaults to a generic prediction (e.g., average CTR for that category: 2%). But the actual CTR might be 8% (a great ad) or 0.2% (a terrible ad). You cannot know without showing the ad to users.

This is a contextual bandit problem: you need to explore (show the new ad to learn its CTR) while exploiting (show high-CTR ads to maximize revenue). The Thompson Sampling approach:

import numpy as np
from dataclasses import dataclass


@dataclass
class AdBanditState:
    """Beta distribution parameters for Thompson Sampling."""
    ad_id: str
    alpha: float = 1.0  # successes (clicks) + 1
    beta: float = 1.0   # failures (non-clicks) + 1

    def sample_ctr(self) -> float:
        """Sample a CTR from the Beta distribution."""
        return float(np.random.beta(self.alpha, self.beta))

    def update(self, clicked: bool) -> None:
        """Update after observing a click or non-click."""
        if clicked:
            self.alpha += 1
        else:
            self.beta += 1

    @property
    def mean_ctr(self) -> float:
        """Point estimate of CTR."""
        return self.alpha / (self.alpha + self.beta)

    @property
    def uncertainty(self) -> float:
        """Variance - higher for ads with fewer impressions."""
        total = self.alpha + self.beta
        return (self.alpha * self.beta) / (total ** 2 * (total + 1))


class ThompsonSamplingExplorer:
    """
    Thompson Sampling for ad exploration.
    New ads start with high uncertainty; sampling naturally explores them.
    As impressions accumulate, the distribution tightens around the true CTR.
    """

    def __init__(self):
        self.ad_states: dict = {}

    def get_or_create_state(self, ad_id: str) -> AdBanditState:
        if ad_id not in self.ad_states:
            self.ad_states[ad_id] = AdBanditState(ad_id=ad_id)
        return self.ad_states[ad_id]

    def score_ads_with_exploration(
        self,
        candidate_ads: list,
        model_scores: np.ndarray,
        exploration_weight: float = 0.1,
    ) -> np.ndarray:
        """
        Blend model CTR prediction with Thompson Sampling exploration.
        For new ads (high uncertainty), exploration dominates.
        For established ads (low uncertainty), model score dominates.
        """
        final_scores = np.zeros(len(candidate_ads))

        for i, ad_id in enumerate(candidate_ads):
            state = self.get_or_create_state(ad_id)
            sampled_ctr = state.sample_ctr()
            model_ctr = float(model_scores[i])

            # Weight exploration by uncertainty
            final_scores[i] = (
                (1 - exploration_weight) * model_ctr
                + exploration_weight * sampled_ctr
            )

        return final_scores

    def update_from_feedback(self, ad_id: str, clicked: bool) -> None:
        """Update bandit state after observing click or non-click."""
        state = self.get_or_create_state(ad_id)
        state.update(clicked)

Serving Under 10ms

import numpy as np
import onnxruntime as ort
import redis
import time


class AdCTRServingEngine:
    """
    Production serving engine for ad CTR prediction.
    Optimized for p99 under 10ms at 5M RPS.

    Optimizations:
    1. ONNX model for fast CPU inference (no PyTorch overhead)
    2. Pre-fetched user features (Redis, fetched once per request)
    3. Batched inference (score all 50 ads in one model forward pass)
    4. LRU cache for frequent (user, ad) pairs
    """

    def __init__(
        self,
        model_path: str,
        redis_host: str = "localhost",
    ):
        # ONNX Runtime with CPU optimizations
        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = (
            ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        )
        sess_options.intra_op_num_threads = 4
        self.session = ort.InferenceSession(
            model_path, sess_options=sess_options
        )

        self.redis = redis.Redis(host=redis_host, decode_responses=True)
        self.feature_extractor = CTRFeatureExtractor()

    def score_ads(
        self,
        user: dict,
        ads: list,
        query: str = "",
    ) -> np.ndarray:
        """
        Score all ads in one batched forward pass.
        Returns array of CTR predictions, one per ad.
        """
        start = time.monotonic()

        # Build feature matrix for all ads (batched)
        features = np.vstack([
            self.feature_extractor.build_feature_vector(user, ad, query)
            for ad in ads
        ])  # shape: (num_ads, num_features)

        # ONNX inference
        input_name = self.session.get_inputs()[0].name
        predictions = self.session.run(
            None,
            {input_name: features},
        )[0].squeeze()

        latency_ms = (time.monotonic() - start) * 1000
        if latency_ms > 8:  # alert if approaching SLA
            print(f"[CTRServing] WARNING: latency {latency_ms:.1f}ms")

        return predictions

:::danger Feedback Loops in Ad Auctions

If a model predicts higher CTR for an ad, that ad wins more auctions, receives more impressions, and accumulates more clicks - which the model interprets as confirmation that the CTR is high. This feedback loop can cause winning ads to have inflated predicted CTRs and losing ads to never accumulate enough data to improve their estimates.

Solution: exploration (Thompson Sampling or epsilon-greedy) forces some traffic to ads that would otherwise lose the auction. Counterfactual evaluation: compute what the ad's CTR would have been if shown in a random position, using inverse propensity weighting to remove position bias from CTR estimates. Without this, the model's training data is biased toward currently winning ads, and the model perpetually over-rates them. :::

:::warning Calibration Failure After Model Updates

After retraining or fine-tuning the CTR model, the calibration almost always deteriorates - the raw model scores no longer map correctly to true CTR. This happens because the training distribution changes, the feature space changes, or the model architecture changes. Uncalibrated CTR predictions cause the auction to misprice ad inventory: if the model predicts 5% CTR when actual is 2%, advertisers overpay for impressions, which reduces advertiser ROI and long-term demand.

Always retrain the calibrator after any model update. Run the calibration check (ECE measurement) as a mandatory deployment gate - if ECE exceeds 2%, do not deploy the model until calibration is fixed. :::

Interview Q&A

Q1: Explain the Wide and Deep architecture and why it works well for CTR prediction.

Wide and Deep (Cheng et al., Google 2016) combines a linear "wide" component with a neural "deep" component. The wide component is a linear model on manually engineered cross-product features (e.g., user_age_group AND ad_category). It memorizes specific patterns from training data - it learns that 18-24 year olds clicking on gaming ads is a strong (user segment, ad category) interaction. The weakness: it only memorizes patterns seen in training; it cannot generalize to new combinations.

The deep component is a multi-layer neural network on embeddings of all categorical features. It generalizes to unseen combinations by learning dense representations of users and ads. The weakness: deep models can fail to capture specific memorized patterns as precisely as a linear model.

By adding the wide and deep logits before applying sigmoid, you get both memorization (from wide) and generalization (from deep). In practice, the deep component is more important for freshness and generalization; the wide component provides a safety net for well-established behavioral patterns. DeepFM (Guo et al., 2017) replaces the wide linear model with a Factorization Machine, which automatically learns all pairwise feature interactions without manual engineering - usually better in practice.

Q2: What is CTR calibration and why does it matter for ad auctions?

CTR calibration means that when the model predicts P(click) = 5%, the actual fraction of impressions at that score that result in a click is approximately 5%. A calibrated model's scores are meaningful probabilities, not just rankings.

Calibration matters for ad auctions because the auction uses predicted CTR to set ad prices (Cost Per Click = bid / CTR). If the model predicts CTR = 5% but actual CTR = 2%, the auction charges the advertiser 2.5x too much per click. Advertisers notice this (their actual cost per conversion is 2.5x higher than modeled), reduce bids, and demand decreases. Over time, uncalibrated CTR causes advertiser ROI to deteriorate and platform revenue to decline.

Measure calibration with Expected Calibration Error (ECE) - split predictions into deciles, compute the mean predicted CTR and actual CTR in each decile, and calculate the weighted average absolute difference. ECE below 2% is acceptable for production CTR systems. Fix calibration with Platt scaling (logistic regression on model output) or isotonic regression (more flexible, non-parametric).

Q3: How does online learning work for CTR models and what are its risks?

Online learning continuously updates model parameters as new click feedback arrives, rather than waiting for a daily or weekly batch retraining cycle. For ads, this is critical: new ads arrive constantly, user interests shift daily, and a daily training cycle means the model is always at least 24 hours behind.

Implementation: clicks and non-clicks stream from the ad serving system to a Kafka topic. A consumer processes these events into mini-batches (256-512 events) and runs SGD updates on the CTR model every 5-10 seconds. Updated parameters are pushed to the serving layer (model version increment) every few minutes.

The risks: (1) Training instability - small batches have high gradient variance; use momentum and a small learning rate; (2) Catastrophic forgetting - aggressive online updates on recent data can cause the model to forget patterns from older data; mitigate with learning rate decay and periodic full retraining on historical data; (3) Feedback loop amplification - if the serving model influences what data is collected, online learning can rapidly amplify biases; monitor for CTR distribution shifts and compare against a shadow model trained on fresh batch data.

Q4: How do you handle the cold start problem for new ads?

New ads have no historical CTR data. The CTR model defaults to a category-based prior (e.g., average CTR for gaming ads is 1.8%). This prior is often wrong by 5-10x for any specific ad.

The solution is a contextual bandit approach. Treat each ad as an arm in a multi-armed bandit. Use Thompson Sampling: maintain a Beta distribution over each ad's CTR (starting at Beta(1,1) = uniform). At each auction, sample a CTR from each ad's distribution. Ads with high uncertainty (Beta(1,1) has high variance) get a chance of being selected even if their mean CTR estimate is mediocre.

As the ad accumulates impressions and clicks, the Beta distribution tightens. After 100 impressions, the CTR estimate is reasonably accurate. After 1,000 impressions, the ad is treated like any established ad and the bandit exploration phase ends. The exploration cost (revenue lost by showing a potentially worse ad) is recovered when you discover that the new ad has high CTR and can now win high-value auctions.

Q5: How does Google serve CTR predictions for 100,000 queries per second?

At 100,000 QPS with 50 ads per query, the CTR serving system scores 5 million (user, ad) pairs per second. The key engineering choices:

Model architecture: the production CTR model is intentionally shallow. Google's published architecture (2016) uses 2-3 hidden layers with 1024 units. Depth is sacrificed for speed. ONNX Runtime or TensorFlow Serving with quantized INT8 inference achieves 50-200 microseconds per forward pass on CPU.

Batching: all 50 ads for a query are scored in one batched forward pass (batch_size=50), not 50 individual calls. The batch dimension parallelizes across CPU cores efficiently.

Feature pre-computation: user features (embedding, behavioral history) are computed once per query and reused for all 50 ad scorings. Only ad-specific features and cross features are computed per (user, ad) pair.

Infrastructure: Google uses custom hardware (Google TPUs were initially developed in part for ad serving) and proprietary model serving infrastructure. At their scale, the serving system is a distributed fleet of servers behind a load balancer, with per-datacenter serving to minimize network latency. The serving latency target at Google is rumored to be under 5ms for the full CTR scoring pass.

Summary

Ad CTR prediction is a revenue-critical ML system requiring extreme accuracy, speed, and calibration. The Wide and Deep architecture combines memorization of specific patterns (wide linear model on cross features) with generalization to new combinations (deep MLP on embeddings). Feature engineering - especially cross features between user interests and ad topics - contributes more to model quality than architecture complexity. Online learning updates the model with new click data every few seconds, keeping it fresh as new ads and user behaviors emerge. Calibration ensures predicted CTR matches actual CTR so the auction can price inventory correctly. The exploration problem (new ads need impressions to learn their CTR) is solved with Thompson Sampling, which naturally explores uncertain ads. Serving under 10ms requires ONNX inference, batched forward passes, pre-computed user features, and shallow architectures optimized for CPU throughput.

The Auction in Milliseconds​

Requirements​

The Scale Challenge​

Feature Engineering at Scale​

Wide and Deep Learning Architecture​

Online Learning for Ad CTR​

Calibration: Predicted CTR Must Match Actual CTR​

The Exploration Problem: New Ads Need Impressions to Learn CTR​

Serving Under 10ms​

Interview Q&A​

Summary​