Bayesian Optimisation - Efficient Hyperparameter Search and Black-Box Optimization
Reading time: 50 min | Interview relevance: Very High - standard knowledge for ML Engineer and Applied Scientist roles; BO is the engine behind hyperparameter tuning at Google Vizier, Optuna, Ray Tune | Target roles: ML Engineer, Applied Scientist, MLOps Engineer, AI Researcher
40 Hyperparameters, 100 Evaluations
A team at DeepMind has built a new reinforcement learning agent for a complex strategic game. The agent uses a policy network with 7 layers, a value network, a curiosity module, and a replay buffer. Together, the system has 40 hyperparameters that need tuning: learning rate, optimizer beta values, entropy regularization coefficient, replay buffer size, n-step returns, priority exponent, batch size, layer widths (7 of them), dropout rates (7 of them), gradient clipping value, discount factor, target network update frequency, and more.
Grid search is obviously infeasible. Even with 2 values per hyperparameter, it would require evaluations. Each evaluation trains the agent for 48 hours on 8 GPUs. At $40 a GPU-hour, one full grid search would cost more than the GDP of a small country. Random search (Bergstra & Bengio 2012) is better - it covers the space more efficiently and empirically performs well when only a few hyperparameters matter - but it ignores information from previous evaluations. After 50 evaluations, random search still picks the next point randomly, wasting the knowledge that certain combinations consistently perform poorly.
Bayesian Optimisation (BO) treats this as a sequential decision problem. After each evaluation, it updates a probabilistic model (the surrogate) of the objective function, uses that model to identify the most promising next point to evaluate (the acquisition function), and iterates. In practice, BO finds near-optimal configurations in 50–200 evaluations for most ML hyperparameter problems, regardless of dimensionality up to about 20–50 parameters. This lesson explains how.
The Black-Box Optimization Problem
The formal setup: we want to maximize (or minimize) a function where:
- is the hyperparameter space (a bounded subset of or a mixed continuous/discrete space)
- is the objective (e.g., validation accuracy after training with hyperparameters )
- Evaluating is expensive - each evaluation takes minutes to hours
- We have no access to gradients of (it is a black box - the output is a scalar, not differentiable with respect to hyperparameters)
- We can query at any point we choose, but budget allows only total evaluations
The black-box property distinguishes this from standard optimization. If were differentiable, gradient ascent would work. If were cheap, random search would work. The expensive + black-box combination is exactly where BO excels.
The BO Loop
Bayesian Optimisation proceeds iteratively:
- Initialize: Evaluate at initial points (typically random points or a Latin hypercube design)
- Fit surrogate: Given all observations , fit a probabilistic surrogate model that provides posterior mean and posterior variance
- Maximize acquisition: Choose the next evaluation point by maximizing an acquisition function that trades off predicted performance and uncertainty:
- Evaluate: Run (train and evaluate the model)
- Update: Add to the dataset, return to step 2
Gaussian Process as Surrogate
The Gaussian Process (GP) is the canonical surrogate model for BO because it provides principled uncertainty quantification via the posterior predictive distribution.
A GP defines a distribution over functions:
where is the prior mean function (often set to zero) and is the covariance (kernel) function.
Given observations at inputs , the posterior distribution at a new point is Gaussian:
with posterior mean and variance:
where is the kernel matrix and is the observation noise variance.
The GP gives us exactly what BO needs: at any unobserved point , we get a predicted mean (our best estimate of ) and a predicted variance (our uncertainty about ). Points near observed data have low variance; points far from all observations have high variance (uncertainty).
Kernel choice: The Matérn-5/2 kernel is the standard choice for hyperparameter optimization:
where and is the length scale. Matérn-5/2 assumes functions that are twice differentiable - a reasonable assumption for smooth objective surfaces like validation accuracy as a function of learning rate and regularization.
GP computational complexity: Fitting the GP requires inverting the kernel matrix: . After observations, this is operations - fast in practice. But it becomes a bottleneck at . For large budgets, use sparse GP approximations (inducing points) or switch to random forest surrogates (used in SMAC3).
Acquisition Functions
The acquisition function is the function you maximize to choose the next evaluation point. It defines the exploration-exploitation tradeoff: points with high predicted mean (exploitation - likely good) vs. high uncertainty (exploration - might find better regions).
Expected Improvement (EI)
The most widely used acquisition function. EI measures the expected amount by which the next observation will exceed the current best :
Under the GP posterior, , so this integral has a closed form:
where:
is the standard normal CDF, and is the standard normal PDF.
Intuition: The first term rewards points where the mean prediction exceeds the current best. The second term rewards points with high uncertainty (exploration bonus). A point with still has positive EI if , because the uncertainty means it might be better.
EI automatically balances exploration and exploitation without a tuning parameter (unlike UCB).
Upper Confidence Bound (UCB)
where controls the exploration-exploitation tradeoff. High : explore uncertain regions. Low : exploit high-mean regions.
UCB has theoretical regret bounds (Srinivas et al. 2010): with appropriate schedule, cumulative regret grows sub-linearly in . In practice, is a common default.
Probability of Improvement (PI)
where is a minimum improvement threshold. PI is more exploitative than EI - it favors points likely to be better, ignoring the magnitude of improvement. Often worse than EI in practice.
Thompson Sampling
Rather than optimizing a closed-form acquisition, Thompson Sampling draws a sample and optimizes the sample:
Advantage: naturally parallelizable. Draw samples, optimize each independently, evaluate all points in parallel. This is the basis for batch BO.
Limitation: optimizing a GP sample over a continuous space is itself a hard optimization problem, typically solved by inner-loop gradient ascent with multiple restarts.
Exploration-Exploitation Tradeoff
Full Python Implementation
import numpy as np
from typing import Callable, Dict, List, Optional, Tuple
from scipy.stats import norm
from scipy.optimize import minimize
from scipy.spatial.distance import cdist
import warnings
warnings.filterwarnings("ignore")
# ─── GAUSSIAN PROCESS ────────────────────────────────────────────────────────
class GaussianProcess:
"""
Gaussian Process with Matérn-5/2 kernel.
Used as the surrogate model in Bayesian Optimisation.
"""
def __init__(
self,
length_scale: float = 1.0,
signal_var: float = 1.0,
noise_var: float = 1e-4,
optimize_hyperparams: bool = True,
):
self.length_scale = length_scale
self.signal_var = signal_var
self.noise_var = noise_var
self.optimize_hyperparams = optimize_hyperparams
self._X_train: Optional[np.ndarray] = None
self._y_train: Optional[np.ndarray] = None
self._K_inv: Optional[np.ndarray] = None
self._alpha: Optional[np.ndarray] = None
def _matern52(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
"""Matérn-5/2 kernel: k(x, x') = sigma^2 (1 + sqrt(5)r/l + 5r^2/3l^2) exp(-sqrt(5)r/l)"""
# Euclidean distances
dists = cdist(X1, X2, metric="euclidean")
r = dists / self.length_scale
sqrt5_r = np.sqrt(5) * r
return self.signal_var * (1 + sqrt5_r + 5 * r**2 / 3) * np.exp(-sqrt5_r)
def fit(self, X: np.ndarray, y: np.ndarray) -> "GaussianProcess":
"""Fit GP to observations (X, y)."""
self._X_train = X.copy()
self._y_train = (y - y.mean()) / (y.std() + 1e-8) # normalize targets
K = self._matern52(X, X) + self.noise_var * np.eye(len(X))
try:
self._K_inv = np.linalg.inv(K)
except np.linalg.LinAlgError:
# Add jitter for numerical stability
K += 1e-6 * np.eye(len(X))
self._K_inv = np.linalg.inv(K)
self._alpha = self._K_inv @ self._y_train
return self
def predict(
self, X_star: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
"""
GP posterior mean and variance at new points X_star.
Returns:
mu: shape (n,) - posterior mean
sigma: shape (n,) - posterior standard deviation
"""
if self._X_train is None:
raise RuntimeError("Call fit() first.")
K_star = self._matern52(X_star, self._X_train) # (n_star, n_train)
K_star_star = self._matern52(X_star, X_star) # (n_star, n_star)
mu = K_star @ self._alpha
cov = K_star_star - K_star @ self._K_inv @ K_star.T
# Take diagonal (point-wise variance), clip for numerical stability
sigma = np.sqrt(np.clip(np.diag(cov), 0, np.inf))
# Denormalize
y_mean = self._y_train.mean() if self._y_train is not None else 0.0
y_std_orig = 1.0 # we normalized above; in full implementation, store y stats
return mu, sigma
# ─── ACQUISITION FUNCTIONS ────────────────────────────────────────────────────
def expected_improvement(
mu: np.ndarray,
sigma: np.ndarray,
f_best: float,
xi: float = 0.01,
) -> np.ndarray:
"""
Expected Improvement acquisition function.
EI(x) = (mu(x) - f_best - xi) * Phi(Z) + sigma(x) * phi(Z)
where Z = (mu(x) - f_best - xi) / sigma(x)
xi: exploration parameter (small positive value)
"""
imp = mu - f_best - xi
Z = np.where(sigma > 1e-8, imp / sigma, 0.0)
ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
ei = np.where(sigma > 1e-8, ei, 0.0) # zero EI where sigma ≈ 0
return ei
def upper_confidence_bound(
mu: np.ndarray,
sigma: np.ndarray,
kappa: float = 2.0,
) -> np.ndarray:
"""
Upper Confidence Bound acquisition function.
UCB(x) = mu(x) + kappa * sigma(x)
kappa > 2: more exploration; kappa < 2: more exploitation.
"""
return mu + kappa * sigma
def probability_of_improvement(
mu: np.ndarray,
sigma: np.ndarray,
f_best: float,
xi: float = 0.01,
) -> np.ndarray:
"""
Probability of Improvement.
PI(x) = Phi((mu(x) - f_best - xi) / sigma(x))
"""
Z = np.where(sigma > 1e-8, (mu - f_best - xi) / sigma, 0.0)
return norm.cdf(Z)
# ─── BAYESIAN OPTIMISATION ────────────────────────────────────────────────────
class BayesianOptimiser:
"""
Gaussian Process-based Bayesian Optimisation.
Usage:
def objective(x):
# Train model with hyperparameters x, return validation accuracy
return -((x[0] - 0.3)**2 + (x[1] + 0.5)**2)
bo = BayesianOptimiser(
objective=objective,
bounds=[(-1, 1), (-1, 1)],
acquisition="ei",
)
best_x, best_y = bo.optimize(n_iter=30, n_init=5)
"""
def __init__(
self,
objective: Callable,
bounds: List[Tuple[float, float]],
acquisition: str = "ei",
kappa: float = 2.0,
xi: float = 0.01,
n_restarts_optimizer: int = 10,
random_seed: int = 42,
):
self.objective = objective
self.bounds = np.array(bounds)
self.dim = len(bounds)
self.acquisition_type = acquisition
self.kappa = kappa
self.xi = xi
self.n_restarts = n_restarts_optimizer
self.rng = np.random.default_rng(random_seed)
self.gp = GaussianProcess()
self.X_obs: List[np.ndarray] = []
self.y_obs: List[float] = []
def _sample_random(self, n: int) -> np.ndarray:
"""Sample n random points from the bounded hyperparameter space."""
lows = self.bounds[:, 0]
highs = self.bounds[:, 1]
return self.rng.uniform(lows, highs, size=(n, self.dim))
def _acquisition(self, X: np.ndarray) -> np.ndarray:
"""Evaluate the acquisition function at points X."""
mu, sigma = self.gp.predict(X)
f_best = max(self.y_obs)
if self.acquisition_type == "ei":
return expected_improvement(mu, sigma, f_best, self.xi)
elif self.acquisition_type == "ucb":
return upper_confidence_bound(mu, sigma, self.kappa)
elif self.acquisition_type == "pi":
return probability_of_improvement(mu, sigma, f_best, self.xi)
else:
raise ValueError(f"Unknown acquisition: {self.acquisition_type}")
def _maximize_acquisition(self) -> np.ndarray:
"""
Find x that maximizes the acquisition function.
Uses multiple random restarts of L-BFGS-B to avoid local optima.
"""
lows = self.bounds[:, 0]
highs = self.bounds[:, 1]
best_x = None
best_val = -np.inf
# Random restarts
for _ in range(self.n_restarts):
x0 = self.rng.uniform(lows, highs)
def neg_acq(x):
return -self._acquisition(x.reshape(1, -1))[0]
result = minimize(
neg_acq,
x0,
method="L-BFGS-B",
bounds=self.bounds,
)
if -result.fun > best_val:
best_val = -result.fun
best_x = result.x
# Also check a grid of random candidates
candidates = self._sample_random(1000)
acq_vals = self._acquisition(candidates)
grid_best_idx = np.argmax(acq_vals)
if acq_vals[grid_best_idx] > best_val:
best_x = candidates[grid_best_idx]
return best_x
def optimize(
self, n_iter: int = 30, n_init: int = 5, verbose: bool = True
) -> Tuple[np.ndarray, float]:
"""
Run Bayesian Optimisation.
n_init: number of random initial evaluations
n_iter: total number of BO iterations (including initial)
Returns (best_x, best_y).
"""
# Phase 1: Random initialization
if verbose:
print(f"Initializing with {n_init} random evaluations...")
X_init = self._sample_random(n_init)
for x in X_init:
y = self.objective(x)
self.X_obs.append(x)
self.y_obs.append(y)
if verbose:
print(f"Initial best: {max(self.y_obs):.4f}")
# Phase 2: Bayesian Optimisation loop
for iteration in range(n_iter - n_init):
# Fit GP on all observations
X_all = np.array(self.X_obs)
y_all = np.array(self.y_obs)
self.gp.fit(X_all, y_all)
# Maximize acquisition to get next point
x_next = self._maximize_acquisition()
# Evaluate objective
y_next = self.objective(x_next)
self.X_obs.append(x_next)
self.y_obs.append(y_next)
if verbose and (iteration + 1) % 5 == 0:
print(
f"Iter {n_init + iteration + 1}/{n_iter}: "
f"y={y_next:.4f}, best_so_far={max(self.y_obs):.4f}"
)
best_idx = np.argmax(self.y_obs)
best_x = self.X_obs[best_idx]
best_y = self.y_obs[best_idx]
if verbose:
print(f"\nBest found: y={best_y:.4f} at x={best_x}")
return best_x, best_y
# ─── COMPARISON: RANDOM SEARCH vs BAYESIAN OPTIMISATION ──────────────────────
def compare_methods(n_trials: int = 5):
"""
Compare random search vs BO on a synthetic 2D objective.
Synthetic objective: negative Branin function (standard BO benchmark).
"""
def branin(x):
"""
Branin function (negated, so we maximize).
Global maximum near (x1=-pi, x2=12.275) and (pi, 2.275) - value ≈ -0.397
"""
x1, x2 = x[0], x[1]
a = 1
b = 5.1 / (4 * np.pi**2)
c = 5 / np.pi
r = 6
s = 10
t = 1 / (8 * np.pi)
val = a*(x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s
return -val # negate for maximization
bounds = [(-5, 10), (0, 15)]
n_init = 5
n_iter = 30
print("="*60)
print(f"COMPARISON: Random Search vs Bayesian Optimisation")
print(f"Objective: Branin function | Budget: {n_iter} evaluations")
print("="*60)
bo_bests = []
rs_bests = []
rng = np.random.default_rng(42)
for trial in range(n_trials):
seed = 42 + trial
# Bayesian Optimisation
bo = BayesianOptimiser(
objective=branin,
bounds=bounds,
acquisition="ei",
random_seed=seed,
)
_, best_bo = bo.optimize(n_iter=n_iter, n_init=n_init, verbose=False)
bo_bests.append(best_bo)
# Random Search
rs_rng = np.random.default_rng(seed)
lows = np.array([b[0] for b in bounds])
highs = np.array([b[1] for b in bounds])
rs_samples = rs_rng.uniform(lows, highs, size=(n_iter, len(bounds)))
rs_values = [branin(x) for x in rs_samples]
rs_bests.append(max(rs_values))
print(f"\nRandom Search - best found: {np.mean(rs_bests):.4f} ± {np.std(rs_bests):.4f}")
print(f"Bayesian Opt - best found: {np.mean(bo_bests):.4f} ± {np.std(bo_bests):.4f}")
print(f"BO improvement: {np.mean(bo_bests) - np.mean(rs_bests):.4f}")
print(f"(Branin global optimum: -0.397)")
# ─── BOTORCH / OPTUNA INTEGRATION ─────────────────────────────────────────────
OPTUNA_EXAMPLE = '''
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
"""
Optuna objective: suggest hyperparameters, train model, return validation loss.
Optuna uses Tree Parzen Estimator (TPE) by default - a BO variant.
"""
import torch
import torch.nn as nn
# Suggest hyperparameters
lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
dropout = trial.suggest_float("dropout", 0.0, 0.5)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
weight_decay = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)
n_layers = trial.suggest_int("n_layers", 1, 4)
hidden_size = trial.suggest_int("hidden_size", 32, 512, step=32)
# Build model
layers = []
in_size = 784 # e.g., MNIST
for _ in range(n_layers):
layers += [nn.Linear(in_size, hidden_size), nn.ReLU(), nn.Dropout(dropout)]
in_size = hidden_size
layers.append(nn.Linear(in_size, 10))
model = nn.Sequential(*layers)
# Train (abbreviated - in practice, use full training loop)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
# ... training loop ...
# Return validation accuracy (Optuna maximizes by default with direction="maximize")
val_accuracy = 0.85 # placeholder
return val_accuracy
# Run optimization with 100 trials
study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.TPESampler(seed=42),
pruner=optuna.pruners.HyperbandPruner(), # early stopping of bad trials
)
study.optimize(objective, n_trials=100, n_jobs=4) # parallel on 4 CPUs
print(f"Best trial: {study.best_trial.params}")
print(f"Best value: {study.best_value:.4f}")
# Optuna dashboard:
# optuna-dashboard sqlite:///optuna.db
'''
BOTORCH_EXAMPLE = '''
# BoTorch: production Bayesian Optimisation from Meta
# pip install botorch
import torch
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition import ExpectedImprovement
from botorch.optim import optimize_acqf
from gpytorch.mlls import ExactMarginalLogLikelihood
def run_botorch_bo(objective_fn, bounds, n_init=5, n_iter=50):
"""
BoTorch-based BO with SingleTaskGP and Expected Improvement.
BoTorch handles:
- GP fitting with exact MLL optimization
- Batch acquisition (q-EI) for parallel evaluations
- GPU acceleration (move tensors to CUDA)
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Normalize bounds to [0, 1]
bounds_tensor = torch.tensor(bounds, dtype=torch.float64, device=device).T
# Initialize with random points
train_X = torch.rand(n_init, len(bounds), dtype=torch.float64, device=device)
train_Y = torch.tensor(
[[objective_fn(x.numpy())] for x in train_X],
dtype=torch.float64, device=device
)
for iteration in range(n_iter):
# Fit GP
gp = SingleTaskGP(train_X, train_Y)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_mll(mll)
# Build EI acquisition
best_f = train_Y.max().item()
EI = ExpectedImprovement(model=gp, best_f=best_f, maximize=True)
# Optimize EI over unit cube (we normalized bounds)
unit_bounds = torch.stack([
torch.zeros(len(bounds), dtype=torch.float64),
torch.ones(len(bounds), dtype=torch.float64)
]).to(device)
candidate, acq_value = optimize_acqf(
EI,
bounds=unit_bounds,
q=1, # q=1 for sequential BO; q>1 for batch BO
num_restarts=10,
raw_samples=100,
)
# Evaluate
y_new = torch.tensor(
[[objective_fn(candidate[0].detach().numpy())]],
dtype=torch.float64, device=device
)
# Update dataset
train_X = torch.cat([train_X, candidate], dim=0)
train_Y = torch.cat([train_Y, y_new], dim=0)
if (iteration + 1) % 10 == 0:
print(f"Iter {n_init + iteration + 1}: best_y={train_Y.max().item():.4f}")
best_idx = train_Y.argmax().item()
return train_X[best_idx].detach().numpy(), train_Y.max().item()
'''
# ─── MULTI-FIDELITY BO (BOHB) ─────────────────────────────────────────────────
BOHB_DESCRIPTION = """
BOHB: Bayesian Optimization with HyperBand (Falkner et al. 2018)
Multi-fidelity BO uses cheap, low-fidelity evaluations (train for 1 epoch instead of 100)
to quickly filter out bad configurations, then uses expensive high-fidelity evaluations
only for the promising candidates.
BOHB combines:
1. HyperBand: successive halving schedule - start with many cheap configs,
progressively eliminate bad ones, train survivors longer
2. Bayesian Optimisation: use a KDE-based model (not GP) to propose new configs
that look promising based on high-performing observations
Algorithm sketch:
- Round 1: Evaluate 81 configs for 1 epoch each. Keep top 27.
- Round 2: Train those 27 for 3 epochs. Keep top 9.
- Round 3: Train those 9 for 9 epochs. Keep top 3.
- Round 4: Train those 3 for 27 epochs. Keep top 1.
- Round 5: Train final winner for 81 epochs.
Total compute: 81*1 + 27*3 + 9*9 + 3*27 + 1*81 = 81+81+81+81+81 = 405 epoch-equivalents
vs. random search at full fidelity: 1 config × 81 epochs = 81 epoch-equiv per config,
so same budget as only 5 random search trials!
Use BOHB via SMAC3 or Ray Tune:
from ray import tune
from ray.tune.schedulers import HyperBandForBOHB
from ray.tune.search.bohb import TuneBOHB
search_alg = TuneBOHB()
scheduler = HyperBandForBOHB(
time_attr="training_iteration",
max_t=81,
reduction_factor=3,
)
tuner = tune.Tuner(
train_fn,
tune_config=tune.TuneConfig(
search_alg=search_alg,
scheduler=scheduler,
num_samples=50,
),
)
results = tuner.fit()
"""
def run_demo():
"""Run the comparison demo."""
compare_methods(n_trials=3)
# Also show a simple 1D BO illustration
print("\n--- 1D BO Illustration ---")
def simple_1d(x):
"""Multi-modal 1D function: f(x) = sin(3x) + x^2 - 0.7*sin(x)"""
return float(np.sin(3*x[0]) + x[0]**2 - 0.7*np.sin(x[0]))
bo_1d = BayesianOptimiser(
objective=lambda x: -simple_1d(x), # maximize the negative
bounds=[(-1.5, 1.5)],
acquisition="ei",
random_seed=42,
)
best_x, best_y = bo_1d.optimize(n_iter=20, n_init=3, verbose=True)
print(f"\n1D result: best_x={best_x[0]:.4f}, best_y={-best_y:.4f}")
if __name__ == "__main__":
run_demo()
Batch BO: Evaluating Multiple Points in Parallel
Sequential BO (one evaluation at a time) is inefficient when you have multiple GPUs or compute nodes available. Batch BO extends BO to select points at each iteration, all evaluated in parallel.
q-Expected Improvement (q-EI): The joint expected improvement of evaluating points simultaneously:
This integral has no closed form for , but BoTorch approximates it via Monte Carlo sampling using the reparameterization trick, making it differentiable and optimizable via gradient ascent.
Fantasy-based approaches: Another strategy is "fantasizing" - for pending evaluations, sample the GP posterior to generate a "fantasy" observation, condition the GP on that, and optimize the next acquisition. Repeat times to get next points. Each point is chosen accounting for the information provided by the previous selections.
In practice, batch sizes of – are common for neural network hyperparameter optimization, corresponding to training multiple model variants in parallel on a cluster.
Trust Region BO (TuRBO) for High Dimensions
Standard GP-based BO struggles in high-dimensional spaces () because:
- GP kernel matrices become near-singular with high-dimensional inputs
- The hyperparameter space grows exponentially, making acquisition optimization harder
- Many dimensions are irrelevant (low effective dimensionality)
TuRBO (Eriksson et al. 2019) addresses this with local search. Instead of a single global GP, TuRBO maintains a set of trust regions - small boxes in hyperparameter space where the GP is a reliable local model.
TuRBO algorithm:
- Initialize one trust region centered at the best observed point, with side length
- Fit a GP within the trust region
- Optimize Thompson Sampling acquisition within the trust region
- Expand the trust region if consecutive successes; shrink on consecutive failures
- Restart with new trust region if it becomes too small
TuRBO achieves state-of-the-art performance on problems with – hyperparameters, where global GP-based BO fails.
Platform Comparison
| Platform | Surrogate | Acquisition | Multi-Fidelity | GPU | Best For |
|---|---|---|---|---|---|
| Google Vizier | GP (exact + sparse) | EI, UCB, Thompson | Yes (ASHA) | Yes | Large-scale production tuning at Google |
| Optuna | TPE (Tree Parzen Estimator) | EI-like | Yes (Hyperband) | No | General purpose, easy API, fast |
| BoTorch | GP (exact, batched) | q-EI, q-UCB, TS | Via custom | Yes | Research, custom acquisition functions |
| Ray Tune | Multiple (BoTorch, Optuna) | Configurable | Yes (ASHA, BOHB) | Yes | Distributed, multi-GPU clusters |
| SMAC3 | Random Forest surrogate | EI | Yes (BOHB) | No | NAS, large categorical spaces |
| Ax (Meta) | GP (BoTorch backend) | EI, qNEI | Yes | Yes | Production A/B, ML experiments at Meta |
Recommendation: Use Optuna for most ML hyperparameter tuning - it is fast, easy to integrate with any framework, and supports Hyperband pruning out of the box. Use BoTorch for research problems requiring custom acquisition functions or batch BO. Use Ray Tune for distributed tuning across a GPU cluster.
BO for Neural Architecture Search (NAS)
NAS (Neural Architecture Search) is the problem of finding the best neural network architecture for a task. The search space includes discrete choices: number of layers, layer types (conv, attention, MLP), kernel sizes, skip connections.
DARTS (Differentiable Architecture Search): Relaxes discrete architecture choices to continuous weights, making the search end-to-end differentiable. Optimizes architecture and weights jointly via gradient descent. Fast but prone to degeneracy (collapsing to skip-connection-dominated architectures).
BO-based NAS: Treat architecture as a discrete hyperparameter space and apply BO. The surrogate must handle discrete inputs - random forest surrogates (SMAC3) or GP with string kernels (designed for graph-structured architecture spaces). BANANAS (White et al. 2021) uses a neural network predictor as the surrogate for NAS, achieving competitive results.
In practice: For most ML practitioners, BO is used for training hyperparameters (learning rate, regularization, batch size), not architecture search. NAS is an expensive research-level tool; BoTorch or Optuna is the practical approach for hyperparameter tuning.
Limitations of Bayesian Optimisation
Dimensionality: GP-based BO works well for hyperparameters. Above 20, the GP kernel becomes unreliable (curse of dimensionality), and acquisition optimization becomes very hard. Use TuRBO for , or embed the high-dimensional space into a low-dimensional subspace (REMBO - Random Embedding BO).
GP complexity: Exact GP inference costs for observations. After 500 evaluations, this becomes a bottleneck. Solutions: sparse GP (inducing points), batched GP on GPU (BoTorch), or switch to random forest surrogates (SMAC3) which scale linearly.
Warm-starting: When you change the training dataset or model architecture, all previous observations are potentially invalid (different objective function). BO cannot warm-start across objective function changes without transfer learning extensions.
Noisy objectives: If training is stochastic and the same hyperparameters give different validation accuracy each time (due to random initialization), the GP model must account for observation noise. Use and consider running multiple repetitions of promising configurations.
Comparison Table: Hyperparameter Search Methods
| Method | Evaluations to Good Config | Parallelizable | Needs Warm-start | Best For |
|---|---|---|---|---|
| Grid search | Yes | No | 1–2 hyperparameters, small grids | |
| Random search | 50–200 | Yes | No | When few hyperparameters matter |
| Hyperband (ASHA) | 20–50 | Yes | No | Fast iteration with cheap early stopping |
| Bayesian Optimisation | 50–150 | Batch BO | Transfer BO | d ≤ 20, expensive evaluations |
| BOHB | 20–100 | Yes | No | Best of BO + Hyperband, most practical |
Common Mistakes
:::danger Mistake 1: Using Bayesian Optimisation when random search is sufficient BO has overhead: GP fitting is , acquisition optimization requires multiple restarts, and the initialization phase requires sequential waiting. If your model trains in 30 seconds, random search with 200 parallel evaluations on a cluster is faster and cheaper than BO. BO beats random search only when evaluations are expensive (> 5 minutes each) and you have a limited budget (< 200 evaluations). For cheap evaluations, random search with parallel execution is the right tool. :::
:::danger Mistake 2: Not scaling hyperparameters before GP fitting
GP kernels measure distance in the input space. If learning rate ranges from to and dropout ranges from 0 to 0.5, the raw distance is dominated by the dropout term. Always log-transform bounded-positive hyperparameters (learning rate, weight decay) before BO, and normalize all hyperparameters to [0, 1] for GP fitting. In Optuna and BoTorch, use log=True for learning rates and use the bounded search space to handle normalization.
:::
:::warning Mistake 3: Treating the best observed hyperparameter as the global optimum BO finds a near-optimal configuration within the budget, but it is not guaranteed to find the global optimum. After BO, re-evaluate the best configuration 3–5 times (with different random seeds) to estimate variance in the result. The configuration that looked best in one training run may have benefited from a lucky random seed. Report mean ± std over multiple runs, not the single best observation. :::
:::warning Mistake 4: Ignoring hyperparameter interactions when setting bounds
BO works within the bounds you specify. If the optimal learning rate depends on batch size (which it does - effective learning rate scales with batch size), setting learning rate bounds independently of batch size may never visit the optimal (lr, batch_size) combination. Define the search space thoughtfully: either fix reasonable bounds that cover the interaction space, or add a derived hyperparameter (e.g., effective_lr = lr * sqrt(batch_size)). Alternatively, use BOHB's warm-starting from lower fidelity evaluations to explore the joint space more efficiently.
:::
YouTube Resources
| Resource | Creator | Focus |
|---|---|---|
| Bayesian Optimisation - A Tutorial | Peter Frazier (Cornell) | Theory: GP surrogate, EI derivation |
| Hyperparameter Tuning with Optuna | Pfeiffer Optuna | Practical Optuna walkthrough |
| BoTorch: Bayesian Optimisation in PyTorch | Meta AI | BoTorch library, batch BO, GPU acceleration |
| BOHB: Hyperband + Bayesian Optimisation | Frank Hutter | BOHB paper explanation and comparison |
| Neural Architecture Search with BO | Colin White | BANANAS: NAS with BO surrogate |
Interview Q&A
Q1: Derive the Expected Improvement acquisition function. What are the two components and what do they represent?
EI is defined as where is the current best observed value. Under the GP posterior, . Let . Then , and the expectation of the positive part is: . The first term is the exploitation term: it is large when the predicted mean exceeds the current best and when the probability of improvement is high. The second term is the exploration term: it is large when uncertainty is high, even if the mean is below the current best. EI automatically balances exploration and exploitation without a tuning parameter.
Q2: When does Bayesian Optimisation beat random search? Give a concrete example with numbers.
Bergstra and Bengio (2012) showed theoretically and empirically that random search beats grid search because hyperparameters vary in importance - if 4 of 10 hyperparameters matter, random search explores the 4-dimensional relevant subspace densely while grid search wastes evaluations on the 6 irrelevant dimensions. BO beats random search when: (1) evaluations are expensive (>5 minutes each), (2) the objective is smooth (GP captures the structure), and (3) budget is limited (<200 evaluations). Concrete example: a neural network with 8 hyperparameters, each evaluation takes 2 hours on a GPU. Budget: 100 evaluations = 200 GPU-hours. Random search picks 100 random configurations independently. BO, after 10 random initializations, observes that high learning rate + high dropout consistently fails, and high weight decay + small batch size performs well. BO concentrates the remaining 90 evaluations near the discovered optimum region. In practice on standard benchmarks, BO finds configurations with 1–3% higher validation accuracy than random search at the same budget, when the objective is smooth and .
Q3: What is multi-fidelity BO and when should you use BOHB over standard BO?
Multi-fidelity BO uses cheap, low-fidelity evaluations (train for 5 epochs) to screen out clearly bad configurations, reserving expensive high-fidelity evaluations (train for 100 epochs) for promising candidates. The key insight from HyperBand: early performance (5 epochs) is correlated with final performance (100 epochs) - not perfectly, but enough to eliminate the bottom 75% of configurations early. BOHB combines HyperBand's successive halving schedule with a BO-based configuration proposer (KDE model). Use BOHB over standard BO when: (1) you can train partial models (early stopping is supported), (2) the budget allows at least 50 evaluations, (3) you expect many clearly bad configurations (learning rates that are too high → immediate divergence). Standard BO is better when: (1) evaluations cannot be stopped early (e.g., full training is required before any signal), (2) (BOHB also struggles at high dimensions, but no worse than BO), (3) you need to integrate custom acquisition functions (BoTorch). In practice, BOHB (via Ray Tune) is the most efficient practical method for neural network hyperparameter tuning: it reduces the effective number of expensive evaluations by 3–10x compared to standard BO.
Q4: How do you handle parallel evaluation in BO (batch BO)? What is the fantasization approach?
Standard BO is sequential: evaluate one point, update the GP, choose the next. But if you have 8 GPUs, you want to evaluate 8 hyperparameter configurations simultaneously. Batch BO selects configurations at each iteration. q-EI (the canonical batch acquisition) computes the joint expected improvement of evaluating all points simultaneously: . This has no closed form for , but BoTorch approximates it via Monte Carlo (reparameterization trick) making it differentiable and optimizable with gradient ascent. The fantasization approach is simpler to implement: to select points, select by standard EI, then "fantasize" (sample) the GP posterior at to get a fantasy observation , condition the GP on , and select from the updated GP. Repeat for . This is computationally cheaper than q-EI but relies on the quality of GP posterior samples. For most applications, batch sizes of – match the number of available GPUs, and Thompson Sampling (sample GP, optimize sample, repeat times) provides a simple, naturally parallel approach.
Q5: A colleague says "Bayesian Optimisation doesn't scale above 20 dimensions - just use random search for large hyperparameter spaces." How do you respond?
This is mostly correct but misses two important tools. The colleague is right that standard GP-based BO struggles above : GP kernels assume a specific smoothness structure that becomes unreliable in high-dimensional spaces, acquisition optimization becomes computationally expensive, and the curse of dimensionality means the GP needs exponentially more data to model the function well. However, two extensions address this: (1) TuRBO (Trust Region BO, Eriksson 2019) uses local GP models within small trust regions around the current best point. It does not try to model the global function; instead, it does efficient local search that restarts when stuck. TuRBO achieves state-of-the-art results on 40–100 dimensional problems. (2) REMBO and ALEBO use random linear projections to embed the high-dimensional space into a low-dimensional subspace, run BO in that subspace, and project back. This works well when the objective has low effective dimensionality (only a few of the hyperparameters truly matter). In practice: for with expensive evaluations, try TuRBO. For moderate (10–30) with quick evaluation, Optuna's TPE (Tree Parzen Estimator) with Hyperband pruning is often the best practical option - it implicitly handles high-dimensional spaces by modeling the distribution of good vs bad configurations with kernel density estimators rather than a global GP.
Key Takeaways
Bayesian Optimisation solves the expensive black-box optimization problem by fitting a Gaussian Process surrogate to all previous evaluations and using an acquisition function to select the most promising next evaluation point. The GP provides posterior mean (prediction) and variance (uncertainty) at any unobserved point. Expected Improvement is the canonical acquisition function - it analytically integrates the probability and magnitude of improvement over the current best. BO beats random search when evaluations are expensive and the objective is smooth, typically finding better configurations in 50–150 evaluations versus 200+ for random search. The main limitations are computational scaling of GP () and the dimensionality barrier above - addressed by TuRBO for local search and BOHB for multi-fidelity evaluation. For practical ML hyperparameter tuning, Optuna with Hyperband pruning is the recommended default. For research requiring custom acquisition functions or batch BO, use BoTorch.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Gaussian Process demo on the EngineersOfAI Playground - no code required.
:::
