What is Bayesian optimization?

How Bayesian Optimisation uses Gaussian Processes and acquisition functions to find near-optimal hyperparameters in far fewer evaluations than grid or random search - with full Python implementation using BoTorch and Optuna.

How does Gaussian process surrogate work in practice?

Bayesian Optimisation - Efficient Hyperparameter Search and Black-Box Optimization covers Bayesian optimization, Gaussian process surrogate, expected improvement from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/bayesian-ml/bayesian-optimisation

What is the difference between Bayesian optimization and expected improvement?

See the full breakdown at https://engineersofai.com/docs/ml/bayesian-ml/bayesian-optimisation

Bayesian Optimisation - Efficient Hyperparameter Search and Black-Box Optimization

Reading time: 50 min | Interview relevance: Very High - standard knowledge for ML Engineer and Applied Scientist roles; BO is the engine behind hyperparameter tuning at Google Vizier, Optuna, Ray Tune | Target roles: ML Engineer, Applied Scientist, MLOps Engineer, AI Researcher

40 Hyperparameters, 100 Evaluations

A team at DeepMind has built a new reinforcement learning agent for a complex strategic game. The agent uses a policy network with 7 layers, a value network, a curiosity module, and a replay buffer. Together, the system has 40 hyperparameters that need tuning: learning rate, optimizer beta values, entropy regularization coefficient, replay buffer size, n-step returns, priority exponent, batch size, layer widths (7 of them), dropout rates (7 of them), gradient clipping value, discount factor, target network update frequency, and more.

Grid search is obviously infeasible. Even with 2 values per hyperparameter, it would require $2^{40} = 10^{12}$ evaluations. Each evaluation trains the agent for 48 hours on 8 GPUs. At $40 a GPU-hour, one full grid search would cost more than the GDP of a small country. Random search (Bergstra & Bengio 2012) is better - it covers the space more efficiently and empirically performs well when only a few hyperparameters matter - but it ignores information from previous evaluations. After 50 evaluations, random search still picks the next point randomly, wasting the knowledge that certain combinations consistently perform poorly.

Bayesian Optimisation (BO) treats this as a sequential decision problem. After each evaluation, it updates a probabilistic model (the surrogate) of the objective function, uses that model to identify the most promising next point to evaluate (the acquisition function), and iterates. In practice, BO finds near-optimal configurations in 50–200 evaluations for most ML hyperparameter problems, regardless of dimensionality up to about 20–50 parameters. This lesson explains how.

The Black-Box Optimization Problem

The formal setup: we want to maximize (or minimize) a function $f : \mathcal{X} \to \mathbb{R}$ where:

$\mathcal{X}$ is the hyperparameter space (a bounded subset of $\mathbb{R}^d$ or a mixed continuous/discrete space)
$f(x)$ is the objective (e.g., validation accuracy after training with hyperparameters $x$ )
Evaluating $f(x)$ is expensive - each evaluation takes minutes to hours
We have no access to gradients of $f$ (it is a black box - the output is a scalar, not differentiable with respect to hyperparameters)
We can query $f$ at any point we choose, but budget allows only $T$ total evaluations

$x^* = \arg\max_{x \in \mathcal{X}} f(x)$

The black-box property distinguishes this from standard optimization. If $f$ were differentiable, gradient ascent would work. If $f$ were cheap, random search would work. The expensive + black-box combination is exactly where BO excels.

The BO Loop

Bayesian Optimisation proceeds iteratively:

Initialize: Evaluate $f$ at $n_0$ initial points (typically $n_0 = 5$ random points or a Latin hypercube design)
Fit surrogate: Given all observations $\{(x_1, f_1), \ldots, (x_t, f_t)\}$ , fit a probabilistic surrogate model $\hat{f}(x)$ that provides posterior mean $\mu(x)$ and posterior variance $\sigma^2(x)$
Maximize acquisition: Choose the next evaluation point by maximizing an acquisition function $\alpha(x)$ that trades off predicted performance and uncertainty: $x_{t+1} = \arg\max_{x \in \mathcal{X}} \alpha(x; \mu, \sigma)$
Evaluate: Run $f(x_{t+1})$ (train and evaluate the model)
Update: Add $(x_{t+1}, f_{t+1})$ to the dataset, return to step 2

Gaussian Process as Surrogate

The Gaussian Process (GP) is the canonical surrogate model for BO because it provides principled uncertainty quantification via the posterior predictive distribution.

A GP defines a distribution over functions:

$f(x) \sim \mathcal{GP}(\mu_0(x), k(x, x'))$

where $\mu_0(x)$ is the prior mean function (often set to zero) and $k(x, x')$ is the covariance (kernel) function.

Given observations $\mathbf{f} = (f_1, \ldots, f_t)^T$ at inputs $X = (x_1, \ldots, x_t)^T$ , the posterior distribution at a new point $x^*$ is Gaussian:

$f(x^*) | X, \mathbf{f} \sim \mathcal{N}(\mu(x^*), \sigma^2(x^*))$

with posterior mean and variance:

$\mu(x^*) = k(x^*, X)\left[K + \sigma_n^2 I\right]^{-1} \mathbf{f}$

$\sigma^2(x^*) = k(x^*, x^*) - k(x^*, X)\left[K + \sigma_n^2 I\right]^{-1} k(X, x^*)$

where $K_{ij} = k(x_i, x_j)$ is the kernel matrix and $\sigma_n^2$ is the observation noise variance.

The GP gives us exactly what BO needs: at any unobserved point $x^*$ , we get a predicted mean (our best estimate of $f(x^*)$ ) and a predicted variance (our uncertainty about $f(x^*)$ ). Points near observed data have low variance; points far from all observations have high variance (uncertainty).

Kernel choice: The Matérn-5/2 kernel is the standard choice for hyperparameter optimization:

$k_{\text{Matérn}}(r) = \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right)\exp\left(-\frac{\sqrt{5}r}{\ell}\right)$

where $r = \|x - x'\|$ and $\ell$ is the length scale. Matérn-5/2 assumes functions that are twice differentiable - a reasonable assumption for smooth objective surfaces like validation accuracy as a function of learning rate and regularization.

GP computational complexity: Fitting the GP requires inverting the $t \times t$ kernel matrix: $O(t^3)$ . After $t = 200$ observations, this is $200^3 = 8 \times 10^6$ operations - fast in practice. But it becomes a bottleneck at $t > 1000$ . For large budgets, use sparse GP approximations (inducing points) or switch to random forest surrogates (used in SMAC3).

Acquisition Functions

The acquisition function $\alpha(x)$ is the function you maximize to choose the next evaluation point. It defines the exploration-exploitation tradeoff: points with high predicted mean (exploitation - likely good) vs. high uncertainty (exploration - might find better regions).

Expected Improvement (EI)

The most widely used acquisition function. EI measures the expected amount by which the next observation will exceed the current best $f^+ = \max_i f_i$ :

$\alpha_{\text{EI}}(x) = \mathbb{E}\left[\max(f(x) - f^+, 0)\right]$

Under the GP posterior, $f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x))$ , so this integral has a closed form:

$\alpha_{\text{EI}}(x) = (\mu(x) - f^+)\Phi(Z) + \sigma(x)\phi(Z)$

where:

$Z = \frac{\mu(x) - f^+}{\sigma(x)}$

$\Phi$ is the standard normal CDF, and $\phi$ is the standard normal PDF.

Intuition: The first term $(\mu(x) - f^+)\Phi(Z)$ rewards points where the mean prediction exceeds the current best. The second term $\sigma(x)\phi(Z)$ rewards points with high uncertainty (exploration bonus). A point with $\mu(x) = f^+$ still has positive EI if $\sigma(x) > 0$ , because the uncertainty means it might be better.

EI automatically balances exploration and exploitation without a tuning parameter (unlike UCB).

Upper Confidence Bound (UCB)

$\alpha_{\text{UCB}}(x) = \mu(x) + \kappa \sigma(x)$

where $\kappa > 0$ controls the exploration-exploitation tradeoff. High $\kappa$ : explore uncertain regions. Low $\kappa$ : exploit high-mean regions.

UCB has theoretical regret bounds (Srinivas et al. 2010): with appropriate $\kappa_t$ schedule, cumulative regret $R_T = T f^* - \sum_{t=1}^T f(x_t)$ grows sub-linearly in $T$ . In practice, $\kappa = 2$ is a common default.

Probability of Improvement (PI)

$\alpha_{\text{PI}}(x) = P(f(x) > f^+ + \xi) = \Phi\left(\frac{\mu(x) - f^+ - \xi}{\sigma(x)}\right)$

where $\xi \geq 0$ is a minimum improvement threshold. PI is more exploitative than EI - it favors points likely to be better, ignoring the magnitude of improvement. Often worse than EI in practice.

Thompson Sampling

Rather than optimizing a closed-form acquisition, Thompson Sampling draws a sample $\tilde{f} \sim \text{GP posterior}$ and optimizes the sample:

$x_{t+1} = \arg\max_{x \in \mathcal{X}} \tilde{f}(x)$

Advantage: naturally parallelizable. Draw $q$ samples, optimize each independently, evaluate all $q$ points in parallel. This is the basis for batch BO.

Limitation: optimizing a GP sample over a continuous space is itself a hard optimization problem, typically solved by inner-loop gradient ascent with multiple restarts.

Exploration-Exploitation Tradeoff

Full Python Implementation

import numpy as np
from typing import Callable, Dict, List, Optional, Tuple
from scipy.stats import norm
from scipy.optimize import minimize
from scipy.spatial.distance import cdist
import warnings
warnings.filterwarnings("ignore")


# ─── GAUSSIAN PROCESS ────────────────────────────────────────────────────────

class GaussianProcess:
    """
    Gaussian Process with Matérn-5/2 kernel.
    Used as the surrogate model in Bayesian Optimisation.
    """

    def __init__(
        self,
        length_scale: float = 1.0,
        signal_var: float = 1.0,
        noise_var: float = 1e-4,
        optimize_hyperparams: bool = True,
    ):
        self.length_scale = length_scale
        self.signal_var = signal_var
        self.noise_var = noise_var
        self.optimize_hyperparams = optimize_hyperparams
        self._X_train: Optional[np.ndarray] = None
        self._y_train: Optional[np.ndarray] = None
        self._K_inv: Optional[np.ndarray] = None
        self._alpha: Optional[np.ndarray] = None

    def _matern52(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Matérn-5/2 kernel: k(x, x') = sigma^2 (1 + sqrt(5)r/l + 5r^2/3l^2) exp(-sqrt(5)r/l)"""
        # Euclidean distances
        dists = cdist(X1, X2, metric="euclidean")
        r = dists / self.length_scale
        sqrt5_r = np.sqrt(5) * r
        return self.signal_var * (1 + sqrt5_r + 5 * r**2 / 3) * np.exp(-sqrt5_r)

    def fit(self, X: np.ndarray, y: np.ndarray) -> "GaussianProcess":
        """Fit GP to observations (X, y)."""
        self._X_train = X.copy()
        self._y_train = (y - y.mean()) / (y.std() + 1e-8)   # normalize targets

        K = self._matern52(X, X) + self.noise_var * np.eye(len(X))
        try:
            self._K_inv = np.linalg.inv(K)
        except np.linalg.LinAlgError:
            # Add jitter for numerical stability
            K += 1e-6 * np.eye(len(X))
            self._K_inv = np.linalg.inv(K)

        self._alpha = self._K_inv @ self._y_train
        return self

    def predict(
        self, X_star: np.ndarray
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        GP posterior mean and variance at new points X_star.
        Returns:
            mu: shape (n,) - posterior mean
            sigma: shape (n,) - posterior standard deviation
        """
        if self._X_train is None:
            raise RuntimeError("Call fit() first.")

        K_star = self._matern52(X_star, self._X_train)   # (n_star, n_train)
        K_star_star = self._matern52(X_star, X_star)       # (n_star, n_star)

        mu = K_star @ self._alpha
        cov = K_star_star - K_star @ self._K_inv @ K_star.T
        # Take diagonal (point-wise variance), clip for numerical stability
        sigma = np.sqrt(np.clip(np.diag(cov), 0, np.inf))

        # Denormalize
        y_mean = self._y_train.mean() if self._y_train is not None else 0.0
        y_std_orig = 1.0  # we normalized above; in full implementation, store y stats
        return mu, sigma


# ─── ACQUISITION FUNCTIONS ────────────────────────────────────────────────────

def expected_improvement(
    mu: np.ndarray,
    sigma: np.ndarray,
    f_best: float,
    xi: float = 0.01,
) -> np.ndarray:
    """
    Expected Improvement acquisition function.

    EI(x) = (mu(x) - f_best - xi) * Phi(Z) + sigma(x) * phi(Z)
    where Z = (mu(x) - f_best - xi) / sigma(x)

    xi: exploration parameter (small positive value)
    """
    imp = mu - f_best - xi
    Z = np.where(sigma > 1e-8, imp / sigma, 0.0)
    ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
    ei = np.where(sigma > 1e-8, ei, 0.0)   # zero EI where sigma ≈ 0
    return ei


def upper_confidence_bound(
    mu: np.ndarray,
    sigma: np.ndarray,
    kappa: float = 2.0,
) -> np.ndarray:
    """
    Upper Confidence Bound acquisition function.
    UCB(x) = mu(x) + kappa * sigma(x)
    kappa > 2: more exploration; kappa < 2: more exploitation.
    """
    return mu + kappa * sigma


def probability_of_improvement(
    mu: np.ndarray,
    sigma: np.ndarray,
    f_best: float,
    xi: float = 0.01,
) -> np.ndarray:
    """
    Probability of Improvement.
    PI(x) = Phi((mu(x) - f_best - xi) / sigma(x))
    """
    Z = np.where(sigma > 1e-8, (mu - f_best - xi) / sigma, 0.0)
    return norm.cdf(Z)


# ─── BAYESIAN OPTIMISATION ────────────────────────────────────────────────────

class BayesianOptimiser:
    """
    Gaussian Process-based Bayesian Optimisation.

    Usage:
        def objective(x):
            # Train model with hyperparameters x, return validation accuracy
            return -((x[0] - 0.3)**2 + (x[1] + 0.5)**2)

        bo = BayesianOptimiser(
            objective=objective,
            bounds=[(-1, 1), (-1, 1)],
            acquisition="ei",
        )
        best_x, best_y = bo.optimize(n_iter=30, n_init=5)
    """

    def __init__(
        self,
        objective: Callable,
        bounds: List[Tuple[float, float]],
        acquisition: str = "ei",
        kappa: float = 2.0,
        xi: float = 0.01,
        n_restarts_optimizer: int = 10,
        random_seed: int = 42,
    ):
        self.objective = objective
        self.bounds = np.array(bounds)
        self.dim = len(bounds)
        self.acquisition_type = acquisition
        self.kappa = kappa
        self.xi = xi
        self.n_restarts = n_restarts_optimizer
        self.rng = np.random.default_rng(random_seed)

        self.gp = GaussianProcess()
        self.X_obs: List[np.ndarray] = []
        self.y_obs: List[float] = []

    def _sample_random(self, n: int) -> np.ndarray:
        """Sample n random points from the bounded hyperparameter space."""
        lows = self.bounds[:, 0]
        highs = self.bounds[:, 1]
        return self.rng.uniform(lows, highs, size=(n, self.dim))

    def _acquisition(self, X: np.ndarray) -> np.ndarray:
        """Evaluate the acquisition function at points X."""
        mu, sigma = self.gp.predict(X)
        f_best = max(self.y_obs)

        if self.acquisition_type == "ei":
            return expected_improvement(mu, sigma, f_best, self.xi)
        elif self.acquisition_type == "ucb":
            return upper_confidence_bound(mu, sigma, self.kappa)
        elif self.acquisition_type == "pi":
            return probability_of_improvement(mu, sigma, f_best, self.xi)
        else:
            raise ValueError(f"Unknown acquisition: {self.acquisition_type}")

    def _maximize_acquisition(self) -> np.ndarray:
        """
        Find x that maximizes the acquisition function.
        Uses multiple random restarts of L-BFGS-B to avoid local optima.
        """
        lows = self.bounds[:, 0]
        highs = self.bounds[:, 1]
        best_x = None
        best_val = -np.inf

        # Random restarts
        for _ in range(self.n_restarts):
            x0 = self.rng.uniform(lows, highs)

            def neg_acq(x):
                return -self._acquisition(x.reshape(1, -1))[0]

            result = minimize(
                neg_acq,
                x0,
                method="L-BFGS-B",
                bounds=self.bounds,
            )
            if -result.fun > best_val:
                best_val = -result.fun
                best_x = result.x

        # Also check a grid of random candidates
        candidates = self._sample_random(1000)
        acq_vals = self._acquisition(candidates)
        grid_best_idx = np.argmax(acq_vals)
        if acq_vals[grid_best_idx] > best_val:
            best_x = candidates[grid_best_idx]

        return best_x

    def optimize(
        self, n_iter: int = 30, n_init: int = 5, verbose: bool = True
    ) -> Tuple[np.ndarray, float]:
        """
        Run Bayesian Optimisation.

        n_init: number of random initial evaluations
        n_iter: total number of BO iterations (including initial)

        Returns (best_x, best_y).
        """
        # Phase 1: Random initialization
        if verbose:
            print(f"Initializing with {n_init} random evaluations...")

        X_init = self._sample_random(n_init)
        for x in X_init:
            y = self.objective(x)
            self.X_obs.append(x)
            self.y_obs.append(y)

        if verbose:
            print(f"Initial best: {max(self.y_obs):.4f}")

        # Phase 2: Bayesian Optimisation loop
        for iteration in range(n_iter - n_init):
            # Fit GP on all observations
            X_all = np.array(self.X_obs)
            y_all = np.array(self.y_obs)
            self.gp.fit(X_all, y_all)

            # Maximize acquisition to get next point
            x_next = self._maximize_acquisition()

            # Evaluate objective
            y_next = self.objective(x_next)
            self.X_obs.append(x_next)
            self.y_obs.append(y_next)

            if verbose and (iteration + 1) % 5 == 0:
                print(
                    f"Iter {n_init + iteration + 1}/{n_iter}: "
                    f"y={y_next:.4f}, best_so_far={max(self.y_obs):.4f}"
                )

        best_idx = np.argmax(self.y_obs)
        best_x = self.X_obs[best_idx]
        best_y = self.y_obs[best_idx]

        if verbose:
            print(f"\nBest found: y={best_y:.4f} at x={best_x}")

        return best_x, best_y


# ─── COMPARISON: RANDOM SEARCH vs BAYESIAN OPTIMISATION ──────────────────────

def compare_methods(n_trials: int = 5):
    """
    Compare random search vs BO on a synthetic 2D objective.
    Synthetic objective: negative Branin function (standard BO benchmark).
    """
    def branin(x):
        """
        Branin function (negated, so we maximize).
        Global maximum near (x1=-pi, x2=12.275) and (pi, 2.275) - value ≈ -0.397
        """
        x1, x2 = x[0], x[1]
        a = 1
        b = 5.1 / (4 * np.pi**2)
        c = 5 / np.pi
        r = 6
        s = 10
        t = 1 / (8 * np.pi)
        val = a*(x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s
        return -val  # negate for maximization

    bounds = [(-5, 10), (0, 15)]
    n_init = 5
    n_iter = 30

    print("="*60)
    print(f"COMPARISON: Random Search vs Bayesian Optimisation")
    print(f"Objective: Branin function | Budget: {n_iter} evaluations")
    print("="*60)

    bo_bests = []
    rs_bests = []
    rng = np.random.default_rng(42)

    for trial in range(n_trials):
        seed = 42 + trial

        # Bayesian Optimisation
        bo = BayesianOptimiser(
            objective=branin,
            bounds=bounds,
            acquisition="ei",
            random_seed=seed,
        )
        _, best_bo = bo.optimize(n_iter=n_iter, n_init=n_init, verbose=False)
        bo_bests.append(best_bo)

        # Random Search
        rs_rng = np.random.default_rng(seed)
        lows = np.array([b[0] for b in bounds])
        highs = np.array([b[1] for b in bounds])
        rs_samples = rs_rng.uniform(lows, highs, size=(n_iter, len(bounds)))
        rs_values = [branin(x) for x in rs_samples]
        rs_bests.append(max(rs_values))

    print(f"\nRandom Search - best found: {np.mean(rs_bests):.4f} ± {np.std(rs_bests):.4f}")
    print(f"Bayesian Opt  - best found: {np.mean(bo_bests):.4f} ± {np.std(bo_bests):.4f}")
    print(f"BO improvement: {np.mean(bo_bests) - np.mean(rs_bests):.4f}")
    print(f"(Branin global optimum: -0.397)")


# ─── BOTORCH / OPTUNA INTEGRATION ─────────────────────────────────────────────

OPTUNA_EXAMPLE = '''
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    """
    Optuna objective: suggest hyperparameters, train model, return validation loss.
    Optuna uses Tree Parzen Estimator (TPE) by default - a BO variant.
    """
    import torch
    import torch.nn as nn

    # Suggest hyperparameters
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    dropout = trial.suggest_float("dropout", 0.0, 0.5)
    batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
    weight_decay = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 4)
    hidden_size = trial.suggest_int("hidden_size", 32, 512, step=32)

    # Build model
    layers = []
    in_size = 784  # e.g., MNIST
    for _ in range(n_layers):
        layers += [nn.Linear(in_size, hidden_size), nn.ReLU(), nn.Dropout(dropout)]
        in_size = hidden_size
    layers.append(nn.Linear(in_size, 10))
    model = nn.Sequential(*layers)

    # Train (abbreviated - in practice, use full training loop)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    # ... training loop ...

    # Return validation accuracy (Optuna maximizes by default with direction="maximize")
    val_accuracy = 0.85  # placeholder
    return val_accuracy


# Run optimization with 100 trials
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=42),
    pruner=optuna.pruners.HyperbandPruner(),  # early stopping of bad trials
)
study.optimize(objective, n_trials=100, n_jobs=4)  # parallel on 4 CPUs

print(f"Best trial: {study.best_trial.params}")
print(f"Best value: {study.best_value:.4f}")

# Optuna dashboard:
# optuna-dashboard sqlite:///optuna.db
'''

BOTORCH_EXAMPLE = '''
# BoTorch: production Bayesian Optimisation from Meta
# pip install botorch

import torch
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition import ExpectedImprovement
from botorch.optim import optimize_acqf
from gpytorch.mlls import ExactMarginalLogLikelihood

def run_botorch_bo(objective_fn, bounds, n_init=5, n_iter=50):
    """
    BoTorch-based BO with SingleTaskGP and Expected Improvement.
    BoTorch handles:
    - GP fitting with exact MLL optimization
    - Batch acquisition (q-EI) for parallel evaluations
    - GPU acceleration (move tensors to CUDA)
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Normalize bounds to [0, 1]
    bounds_tensor = torch.tensor(bounds, dtype=torch.float64, device=device).T

    # Initialize with random points
    train_X = torch.rand(n_init, len(bounds), dtype=torch.float64, device=device)
    train_Y = torch.tensor(
        [[objective_fn(x.numpy())] for x in train_X],
        dtype=torch.float64, device=device
    )

    for iteration in range(n_iter):
        # Fit GP
        gp = SingleTaskGP(train_X, train_Y)
        mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
        fit_gpytorch_mll(mll)

        # Build EI acquisition
        best_f = train_Y.max().item()
        EI = ExpectedImprovement(model=gp, best_f=best_f, maximize=True)

        # Optimize EI over unit cube (we normalized bounds)
        unit_bounds = torch.stack([
            torch.zeros(len(bounds), dtype=torch.float64),
            torch.ones(len(bounds), dtype=torch.float64)
        ]).to(device)

        candidate, acq_value = optimize_acqf(
            EI,
            bounds=unit_bounds,
            q=1,           # q=1 for sequential BO; q>1 for batch BO
            num_restarts=10,
            raw_samples=100,
        )

        # Evaluate
        y_new = torch.tensor(
            [[objective_fn(candidate[0].detach().numpy())]],
            dtype=torch.float64, device=device
        )

        # Update dataset
        train_X = torch.cat([train_X, candidate], dim=0)
        train_Y = torch.cat([train_Y, y_new], dim=0)

        if (iteration + 1) % 10 == 0:
            print(f"Iter {n_init + iteration + 1}: best_y={train_Y.max().item():.4f}")

    best_idx = train_Y.argmax().item()
    return train_X[best_idx].detach().numpy(), train_Y.max().item()
'''


# ─── MULTI-FIDELITY BO (BOHB) ─────────────────────────────────────────────────

BOHB_DESCRIPTION = """
BOHB: Bayesian Optimization with HyperBand (Falkner et al. 2018)

Multi-fidelity BO uses cheap, low-fidelity evaluations (train for 1 epoch instead of 100)
to quickly filter out bad configurations, then uses expensive high-fidelity evaluations
only for the promising candidates.

BOHB combines:
1. HyperBand: successive halving schedule - start with many cheap configs,
   progressively eliminate bad ones, train survivors longer
2. Bayesian Optimisation: use a KDE-based model (not GP) to propose new configs
   that look promising based on high-performing observations

Algorithm sketch:
- Round 1: Evaluate 81 configs for 1 epoch each. Keep top 27.
- Round 2: Train those 27 for 3 epochs. Keep top 9.
- Round 3: Train those 9 for 9 epochs. Keep top 3.
- Round 4: Train those 3 for 27 epochs. Keep top 1.
- Round 5: Train final winner for 81 epochs.

Total compute: 81*1 + 27*3 + 9*9 + 3*27 + 1*81 = 81+81+81+81+81 = 405 epoch-equivalents
vs. random search at full fidelity: 1 config × 81 epochs = 81 epoch-equiv per config,
so same budget as only 5 random search trials!

Use BOHB via SMAC3 or Ray Tune:

from ray import tune
from ray.tune.schedulers import HyperBandForBOHB
from ray.tune.search.bohb import TuneBOHB

search_alg = TuneBOHB()
scheduler = HyperBandForBOHB(
    time_attr="training_iteration",
    max_t=81,
    reduction_factor=3,
)
tuner = tune.Tuner(
    train_fn,
    tune_config=tune.TuneConfig(
        search_alg=search_alg,
        scheduler=scheduler,
        num_samples=50,
    ),
)
results = tuner.fit()
"""


def run_demo():
    """Run the comparison demo."""
    compare_methods(n_trials=3)

    # Also show a simple 1D BO illustration
    print("\n--- 1D BO Illustration ---")
    def simple_1d(x):
        """Multi-modal 1D function: f(x) = sin(3x) + x^2 - 0.7*sin(x)"""
        return float(np.sin(3*x[0]) + x[0]**2 - 0.7*np.sin(x[0]))

    bo_1d = BayesianOptimiser(
        objective=lambda x: -simple_1d(x),   # maximize the negative
        bounds=[(-1.5, 1.5)],
        acquisition="ei",
        random_seed=42,
    )
    best_x, best_y = bo_1d.optimize(n_iter=20, n_init=3, verbose=True)
    print(f"\n1D result: best_x={best_x[0]:.4f}, best_y={-best_y:.4f}")


if __name__ == "__main__":
    run_demo()

Batch BO: Evaluating Multiple Points in Parallel

Sequential BO (one evaluation at a time) is inefficient when you have multiple GPUs or compute nodes available. Batch BO extends BO to select $q$ points at each iteration, all evaluated in parallel.

q-Expected Improvement (q-EI): The joint expected improvement of evaluating $q$ points simultaneously:

$\alpha_{q\text{-EI}}(X) = \mathbb{E}\left[\max_{i=1}^{q} \max(f(x_i) - f^+, 0)\right]$

This integral has no closed form for $q > 1$ , but BoTorch approximates it via Monte Carlo sampling using the reparameterization trick, making it differentiable and optimizable via gradient ascent.

Fantasy-based approaches: Another strategy is "fantasizing" - for pending evaluations, sample the GP posterior to generate a "fantasy" observation, condition the GP on that, and optimize the next acquisition. Repeat $q$ times to get $q$ next points. Each point is chosen accounting for the information provided by the previous selections.

In practice, batch sizes of $q = 4$ – $16$ are common for neural network hyperparameter optimization, corresponding to training multiple model variants in parallel on a cluster.

Trust Region BO (TuRBO) for High Dimensions

Standard GP-based BO struggles in high-dimensional spaces ( $d > 20$ ) because:

GP kernel matrices become near-singular with high-dimensional inputs
The hyperparameter space grows exponentially, making acquisition optimization harder
Many dimensions are irrelevant (low effective dimensionality)

TuRBO (Eriksson et al. 2019) addresses this with local search. Instead of a single global GP, TuRBO maintains a set of trust regions - small boxes in hyperparameter space where the GP is a reliable local model.

TuRBO algorithm:

Initialize one trust region centered at the best observed point, with side length $L$
Fit a GP within the trust region
Optimize Thompson Sampling acquisition within the trust region
Expand the trust region if consecutive successes; shrink on consecutive failures
Restart with new trust region if it becomes too small

TuRBO achieves state-of-the-art performance on problems with $d = 40$ – $100$ hyperparameters, where global GP-based BO fails.

Platform Comparison

Platform	Surrogate	Acquisition	Multi-Fidelity	GPU	Best For
Google Vizier	GP (exact + sparse)	EI, UCB, Thompson	Yes (ASHA)	Yes	Large-scale production tuning at Google
Optuna	TPE (Tree Parzen Estimator)	EI-like	Yes (Hyperband)	No	General purpose, easy API, fast
BoTorch	GP (exact, batched)	q-EI, q-UCB, TS	Via custom	Yes	Research, custom acquisition functions
Ray Tune	Multiple (BoTorch, Optuna)	Configurable	Yes (ASHA, BOHB)	Yes	Distributed, multi-GPU clusters
SMAC3	Random Forest surrogate	EI	Yes (BOHB)	No	NAS, large categorical spaces
Ax (Meta)	GP (BoTorch backend)	EI, qNEI	Yes	Yes	Production A/B, ML experiments at Meta

Recommendation: Use Optuna for most ML hyperparameter tuning - it is fast, easy to integrate with any framework, and supports Hyperband pruning out of the box. Use BoTorch for research problems requiring custom acquisition functions or batch BO. Use Ray Tune for distributed tuning across a GPU cluster.

BO for Neural Architecture Search (NAS)

NAS (Neural Architecture Search) is the problem of finding the best neural network architecture for a task. The search space includes discrete choices: number of layers, layer types (conv, attention, MLP), kernel sizes, skip connections.

DARTS (Differentiable Architecture Search): Relaxes discrete architecture choices to continuous weights, making the search end-to-end differentiable. Optimizes architecture and weights jointly via gradient descent. Fast but prone to degeneracy (collapsing to skip-connection-dominated architectures).

BO-based NAS: Treat architecture as a discrete hyperparameter space and apply BO. The surrogate must handle discrete inputs - random forest surrogates (SMAC3) or GP with string kernels (designed for graph-structured architecture spaces). BANANAS (White et al. 2021) uses a neural network predictor as the surrogate for NAS, achieving competitive results.

In practice: For most ML practitioners, BO is used for training hyperparameters (learning rate, regularization, batch size), not architecture search. NAS is an expensive research-level tool; BoTorch or Optuna is the practical approach for hyperparameter tuning.

Limitations of Bayesian Optimisation

Dimensionality: GP-based BO works well for $d \leq 20$ hyperparameters. Above 20, the GP kernel becomes unreliable (curse of dimensionality), and acquisition optimization becomes very hard. Use TuRBO for $d > 20$ , or embed the high-dimensional space into a low-dimensional subspace (REMBO - Random Embedding BO).

GP complexity: Exact GP inference costs $O(t^3)$ for $t$ observations. After 500 evaluations, this becomes a bottleneck. Solutions: sparse GP (inducing points), batched GP on GPU (BoTorch), or switch to random forest surrogates (SMAC3) which scale linearly.

Warm-starting: When you change the training dataset or model architecture, all previous observations are potentially invalid (different objective function). BO cannot warm-start across objective function changes without transfer learning extensions.

Noisy objectives: If training is stochastic and the same hyperparameters give different validation accuracy each time (due to random initialization), the GP model must account for observation noise. Use $\sigma_n^2 > 0$ and consider running multiple repetitions of promising configurations.

Comparison Table: Hyperparameter Search Methods

Method	Evaluations to Good Config	Parallelizable	Needs Warm-start	Best For
Grid search	$10^d$	Yes	No	1–2 hyperparameters, small grids
Random search	50–200	Yes	No	When few hyperparameters matter
Hyperband (ASHA)	20–50	Yes	No	Fast iteration with cheap early stopping
Bayesian Optimisation	50–150	Batch BO	Transfer BO	d ≤ 20, expensive evaluations
BOHB	20–100	Yes	No	Best of BO + Hyperband, most practical

Common Mistakes

:::danger Mistake 1: Using Bayesian Optimisation when random search is sufficient BO has overhead: GP fitting is $O(t^3)$ , acquisition optimization requires multiple restarts, and the initialization phase requires sequential waiting. If your model trains in 30 seconds, random search with 200 parallel evaluations on a cluster is faster and cheaper than BO. BO beats random search only when evaluations are expensive (> 5 minutes each) and you have a limited budget (< 200 evaluations). For cheap evaluations, random search with parallel execution is the right tool. :::

:::danger Mistake 2: Not scaling hyperparameters before GP fitting GP kernels measure distance in the input space. If learning rate ranges from $10^{-5}$ to $10^{-1}$ and dropout ranges from 0 to 0.5, the raw distance is dominated by the dropout term. Always log-transform bounded-positive hyperparameters (learning rate, weight decay) before BO, and normalize all hyperparameters to [0, 1] for GP fitting. In Optuna and BoTorch, use log=True for learning rates and use the bounded search space to handle normalization. :::

:::warning Mistake 3: Treating the best observed hyperparameter as the global optimum BO finds a near-optimal configuration within the budget, but it is not guaranteed to find the global optimum. After BO, re-evaluate the best configuration 3–5 times (with different random seeds) to estimate variance in the result. The configuration that looked best in one training run may have benefited from a lucky random seed. Report mean ± std over multiple runs, not the single best observation. :::

:::warning Mistake 4: Ignoring hyperparameter interactions when setting bounds BO works within the bounds you specify. If the optimal learning rate depends on batch size (which it does - effective learning rate scales with batch size), setting learning rate bounds independently of batch size may never visit the optimal (lr, batch_size) combination. Define the search space thoughtfully: either fix reasonable bounds that cover the interaction space, or add a derived hyperparameter (e.g., effective_lr = lr * sqrt(batch_size)). Alternatively, use BOHB's warm-starting from lower fidelity evaluations to explore the joint space more efficiently. :::

YouTube Resources

Resource	Creator	Focus
Bayesian Optimisation - A Tutorial	Peter Frazier (Cornell)	Theory: GP surrogate, EI derivation
Hyperparameter Tuning with Optuna	Pfeiffer Optuna	Practical Optuna walkthrough
BoTorch: Bayesian Optimisation in PyTorch	Meta AI	BoTorch library, batch BO, GPU acceleration
BOHB: Hyperband + Bayesian Optimisation	Frank Hutter	BOHB paper explanation and comparison
Neural Architecture Search with BO	Colin White	BANANAS: NAS with BO surrogate

Interview Q&A

Q1: Derive the Expected Improvement acquisition function. What are the two components and what do they represent?

EI is defined as $\alpha_{\text{EI}}(x) = \mathbb{E}[\max(f(x) - f^+, 0)]$ where $f^+$ is the current best observed value. Under the GP posterior, $f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x))$ . Let $Z = (\mu(x) - f^+) / \sigma(x)$ . Then $f(x) - f^+ \sim \sigma(x) \cdot (\mathcal{N}(Z, 1) - Z)$ , and the expectation of the positive part is: $\mathbb{E}[\max(f(x) - f^+, 0)] = \sigma(x) \cdot \mathbb{E}[\max(\mathcal{N}(Z, 1), 0)] = (\mu(x) - f^+)\Phi(Z) + \sigma(x)\phi(Z)$ . The first term $(\mu(x) - f^+)\Phi(Z)$ is the exploitation term: it is large when the predicted mean exceeds the current best and when the probability $\Phi(Z)$ of improvement is high. The second term $\sigma(x)\phi(Z)$ is the exploration term: it is large when uncertainty $\sigma(x)$ is high, even if the mean is below the current best. EI automatically balances exploration and exploitation without a tuning parameter.

Q2: When does Bayesian Optimisation beat random search? Give a concrete example with numbers.

Bergstra and Bengio (2012) showed theoretically and empirically that random search beats grid search because hyperparameters vary in importance - if 4 of 10 hyperparameters matter, random search explores the 4-dimensional relevant subspace densely while grid search wastes evaluations on the 6 irrelevant dimensions. BO beats random search when: (1) evaluations are expensive (>5 minutes each), (2) the objective is smooth (GP captures the structure), and (3) budget is limited (<200 evaluations). Concrete example: a neural network with 8 hyperparameters, each evaluation takes 2 hours on a GPU. Budget: 100 evaluations = 200 GPU-hours. Random search picks 100 random configurations independently. BO, after 10 random initializations, observes that high learning rate + high dropout consistently fails, and high weight decay + small batch size performs well. BO concentrates the remaining 90 evaluations near the discovered optimum region. In practice on standard benchmarks, BO finds configurations with 1–3% higher validation accuracy than random search at the same budget, when the objective is smooth and $d \leq 20$ .

Q3: What is multi-fidelity BO and when should you use BOHB over standard BO?

Multi-fidelity BO uses cheap, low-fidelity evaluations (train for 5 epochs) to screen out clearly bad configurations, reserving expensive high-fidelity evaluations (train for 100 epochs) for promising candidates. The key insight from HyperBand: early performance (5 epochs) is correlated with final performance (100 epochs) - not perfectly, but enough to eliminate the bottom 75% of configurations early. BOHB combines HyperBand's successive halving schedule with a BO-based configuration proposer (KDE model). Use BOHB over standard BO when: (1) you can train partial models (early stopping is supported), (2) the budget allows at least 50 evaluations, (3) you expect many clearly bad configurations (learning rates that are too high → immediate divergence). Standard BO is better when: (1) evaluations cannot be stopped early (e.g., full training is required before any signal), (2) $d > 20$ (BOHB also struggles at high dimensions, but no worse than BO), (3) you need to integrate custom acquisition functions (BoTorch). In practice, BOHB (via Ray Tune) is the most efficient practical method for neural network hyperparameter tuning: it reduces the effective number of expensive evaluations by 3–10x compared to standard BO.

Q4: How do you handle parallel evaluation in BO (batch BO)? What is the fantasization approach?

Standard BO is sequential: evaluate one point, update the GP, choose the next. But if you have 8 GPUs, you want to evaluate 8 hyperparameter configurations simultaneously. Batch BO selects $q$ configurations at each iteration. q-EI (the canonical batch acquisition) computes the joint expected improvement of evaluating all $q$ points simultaneously: $\alpha_{q\text{-EI}}(X) = \mathbb{E}[\max_i \max(f(x_i) - f^+, 0)]$ . This has no closed form for $q > 1$ , but BoTorch approximates it via Monte Carlo (reparameterization trick) making it differentiable and optimizable with gradient ascent. The fantasization approach is simpler to implement: to select $q$ points, select $x_1$ by standard EI, then "fantasize" (sample) the GP posterior at $x_1$ to get a fantasy observation $\tilde{f}_1$ , condition the GP on $(x_1, \tilde{f}_1)$ , and select $x_2$ from the updated GP. Repeat for $x_3, \ldots, x_q$ . This is computationally cheaper than q-EI but relies on the quality of GP posterior samples. For most applications, batch sizes of $q = 4$ – $8$ match the number of available GPUs, and Thompson Sampling (sample GP, optimize sample, repeat $q$ times) provides a simple, naturally parallel approach.

Q5: A colleague says "Bayesian Optimisation doesn't scale above 20 dimensions - just use random search for large hyperparameter spaces." How do you respond?

This is mostly correct but misses two important tools. The colleague is right that standard GP-based BO struggles above $d = 20$ : GP kernels assume a specific smoothness structure that becomes unreliable in high-dimensional spaces, acquisition optimization becomes computationally expensive, and the curse of dimensionality means the GP needs exponentially more data to model the function well. However, two extensions address this: (1) TuRBO (Trust Region BO, Eriksson 2019) uses local GP models within small trust regions around the current best point. It does not try to model the global function; instead, it does efficient local search that restarts when stuck. TuRBO achieves state-of-the-art results on 40–100 dimensional problems. (2) REMBO and ALEBO use random linear projections to embed the high-dimensional space into a low-dimensional subspace, run BO in that subspace, and project back. This works well when the objective has low effective dimensionality (only a few of the $d$ hyperparameters truly matter). In practice: for $d > 20$ with expensive evaluations, try TuRBO. For moderate $d$ (10–30) with quick evaluation, Optuna's TPE (Tree Parzen Estimator) with Hyperband pruning is often the best practical option - it implicitly handles high-dimensional spaces by modeling the distribution of good vs bad configurations with kernel density estimators rather than a global GP.

Key Takeaways

Bayesian Optimisation solves the expensive black-box optimization problem by fitting a Gaussian Process surrogate to all previous evaluations and using an acquisition function to select the most promising next evaluation point. The GP provides posterior mean (prediction) and variance (uncertainty) at any unobserved point. Expected Improvement is the canonical acquisition function - it analytically integrates the probability and magnitude of improvement over the current best. BO beats random search when evaluations are expensive and the objective is smooth, typically finding better configurations in 50–150 evaluations versus 200+ for random search. The main limitations are computational scaling of GP ( $O(t^3)$ ) and the dimensionality barrier above $d = 20$ - addressed by TuRBO for local search and BOHB for multi-fidelity evaluation. For practical ML hyperparameter tuning, Optuna with Hyperband pruning is the recommended default. For research requiring custom acquisition functions or batch BO, use BoTorch.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Gaussian Process demo on the EngineersOfAI Playground - no code required.

:::

40 Hyperparameters, 100 Evaluations​

The Black-Box Optimization Problem​

The BO Loop​

Gaussian Process as Surrogate​

Acquisition Functions​

Expected Improvement (EI)​

Upper Confidence Bound (UCB)​

Probability of Improvement (PI)​

Thompson Sampling​

Exploration-Exploitation Tradeoff​

Full Python Implementation​

Batch BO: Evaluating Multiple Points in Parallel​

Trust Region BO (TuRBO) for High Dimensions​

Platform Comparison​

BO for Neural Architecture Search (NAS)​

Limitations of Bayesian Optimisation​

Comparison Table: Hyperparameter Search Methods​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Key Takeaways​