Hyperparameter Optimization
Four Days, 200 Trials, Wrong Region
The deadline is Friday. Your model needs to hit 0.85 AUC on the holdout set to get approved for A/B testing. It is Monday. You have four days of GPU time budgeted.
Your junior engineer launches a grid search: learning rate in [1e-4, 1e-3, 1e-2], batch size in [32, 64, 128], dropout in [0.1, 0.2, 0.3, 0.4], and number of layers in [2, 4, 6]. That is 324 combinations. At 15 minutes per trial, it would take 81 hours - you only have 48. He subsets it to 200 trials, choosing the combinations "that seem most important."
Friday arrives. Best result: 0.831 AUC. You missed the threshold. You look at the trial results and notice something: all 200 trials used learning rates between 1e-4 and 1e-3. The optimal learning rate for this architecture is actually around 4e-4 - a value never explicitly tested, sitting in the gap between your grid points. The grid forced you to test large regions of bad space and never sample the good region densely enough.
Meanwhile, a competing team running Bayesian optimization with 50 trials found 0.863 AUC in 12 hours.
This lesson is about doing HPO correctly.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Experiment Tracking with MLflow demo on the EngineersOfAI Playground - no code required. :::
Why Hyperparameter Optimization Exists
The mathematical formulation is straightforward. Let be the validation metric as a function of the hyperparameter configuration . HPO is the problem:
where involves training a model with configuration and measuring its validation performance. The challenge: is expensive (minutes to days per evaluation), black-box (no gradient information), noisy (stochastic training), and high-dimensional (tens of hyperparameters).
The key insight driving modern HPO: not all hyperparameter configurations are equally promising, and we can learn which regions of the space are promising from previous trials without needing to evaluate the full space.
The Landscape of HPO Methods
Grid Search: When It Works and When It Fails
Grid search evaluates every point on a regular grid over the hyperparameter space. It is guaranteed to find the grid point closest to the optimum - but it is only useful when:
- You have very few hyperparameters (1–2)
- The important hyperparameters are discrete with few values
- You have unlimited compute budget
The critical failure mode: with hyperparameters each taking values, you need evaluations. With 5 hyperparameters and 5 values each: trials. At 1 hour per trial, that is 130 days.
Worse: if only 2 of your 5 hyperparameters matter, grid search wastes 80% of its budget varying the irrelevant 3. Random search does not have this problem.
# When grid search is appropriate: small discrete spaces
from itertools import product
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
param_grid = {
"C": [0.01, 0.1, 1.0, 10.0, 100.0],
"kernel": ["rbf", "poly", "sigmoid"],
"gamma": ["scale", "auto"],
}
results = []
for C, kernel, gamma in product(
param_grid["C"], param_grid["kernel"], param_grid["gamma"]
):
score = cross_val_score(
SVC(C=C, kernel=kernel, gamma=gamma),
X_train, y_train, cv=5, scoring="roc_auc"
).mean()
results.append({"C": C, "kernel": kernel, "gamma": gamma, "auc": score})
best = max(results, key=lambda x: x["auc"])
print(f"Best: {best}")
Random Search: Better Than Grid for Almost Everything
Random search samples uniformly at random from the hyperparameter space. It sounds naive but outperforms grid search in practice because:
-
It samples every hyperparameter independently. If 3 out of 6 hyperparameters are irrelevant, grid search still tiles all values of those 3. Random search does not - it gets more diverse coverage of the important ones.
-
It covers continuous spaces naturally. A continuous learning rate from 1e-5 to 1e-1 has infinitely many values. Grid search picks 3-5 discrete points. Random search can hit any point in the range.
The mathematical result: with trials, random search with high probability finds a configuration within the top of the space in trials. Grid search makes no such guarantee.
import random
import numpy as np
def sample_random_config(search_space: dict, seed: int = None) -> dict:
"""Sample a random hyperparameter configuration."""
if seed is not None:
random.seed(seed)
np.random.seed(seed)
config = {}
for param, spec in search_space.items():
if spec["type"] == "float":
if spec.get("log", False):
config[param] = np.exp(
np.random.uniform(np.log(spec["min"]), np.log(spec["max"]))
)
else:
config[param] = np.random.uniform(spec["min"], spec["max"])
elif spec["type"] == "int":
config[param] = np.random.randint(spec["min"], spec["max"] + 1)
elif spec["type"] == "categorical":
config[param] = random.choice(spec["values"])
return config
search_space = {
"learning_rate": {"type": "float", "min": 1e-5, "max": 1e-1, "log": True},
"batch_size": {"type": "categorical", "values": [32, 64, 128, 256]},
"dropout": {"type": "float", "min": 0.0, "max": 0.5},
"num_layers": {"type": "int", "min": 2, "max": 12},
"d_model": {"type": "categorical", "values": [64, 128, 256, 512]},
}
# Generate 100 random configs and evaluate
for trial in range(100):
config = sample_random_config(search_space, seed=trial)
val_auc = train_and_evaluate(**config)
print(f"Trial {trial}: AUC={val_auc:.4f} | LR={config['learning_rate']:.2e}")
Optuna: Production-Grade HPO
Optuna is the most practical HPO library for Python ML teams. It is framework-agnostic, supports every major search algorithm, integrates with MLflow and W&B natively, and has excellent support for parallel execution and pruning.
Core Optuna Concepts
- Study: the optimization session - one study per HPO task
- Trial: one evaluation of the objective function - trains one model
- Objective function: takes a
Trialobject, suggests hyperparameters, trains, returns metric - Sampler: the search algorithm (TPE, random, CMA-ES, etc.)
- Pruner: decides whether to stop a trial early based on intermediate results
Basic Optuna Usage
import optuna
from optuna.integration.mlflow import MLflowCallback
import mlflow
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial: optuna.Trial) -> float:
"""Objective function: train one model, return validation AUC."""
# Suggest hyperparameters
lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
dropout = trial.suggest_float("dropout", 0.0, 0.5)
num_layers = trial.suggest_int("num_layers", 2, 12)
d_model = trial.suggest_categorical("d_model", [64, 128, 256, 512])
weight_decay = trial.suggest_float("weight_decay", 1e-6, 1e-1, log=True)
scheduler = trial.suggest_categorical(
"scheduler", ["cosine", "step", "linear", "constant"]
)
# Build and train model
model = build_model(num_layers=num_layers, d_model=d_model, dropout=dropout)
optimizer = build_optimizer(model, lr=lr, weight_decay=weight_decay)
best_val_auc = 0.0
for epoch in range(MAX_EPOCHS):
train_epoch(model, train_loader, optimizer)
val_auc = evaluate(model, val_loader)["auc"]
# Report intermediate value to pruner
trial.report(val_auc, step=epoch)
# Prune if unpromising
if trial.should_prune():
raise optuna.TrialPruned()
best_val_auc = max(best_val_auc, val_auc)
return best_val_auc
# Create study with Bayesian sampler (TPE) and Hyperband pruner
study = optuna.create_study(
study_name="ctr_model_hpo",
direction="maximize",
sampler=optuna.samplers.TPESampler(
seed=42,
n_startup_trials=10, # use random search for first 10 trials
multivariate=True, # model parameter correlations
),
pruner=optuna.pruners.HyperbandPruner(
min_resource=3, # minimum epochs before pruning
max_resource=MAX_EPOCHS,
reduction_factor=3,
),
storage="sqlite:///hpo_study.db", # persistent storage - can resume
load_if_exists=True, # resume if study already exists
)
# Attach MLflow callback to log each trial as a run
mlflow_callback = MLflowCallback(
tracking_uri="http://mlflow.internal:5000",
metric_name="val_auc",
create_experiment=False,
mlflow_kwargs={
"experiment_id": mlflow.get_experiment_by_name("hpo_sweep").experiment_id,
},
)
# Run optimization - can be called multiple times to add more trials
study.optimize(
objective,
n_trials=200,
n_jobs=4, # parallel trials (if objective is thread-safe)
callbacks=[mlflow_callback],
gc_after_trial=True, # free GPU memory between trials
)
# Analyze results
print(f"Best trial: #{study.best_trial.number}")
print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
# Visualize parameter importance
fig = optuna.visualization.plot_param_importances(study)
fig.write_image("param_importances.png")
TPE: Tree-structured Parzen Estimator
TPE is Optuna's default sampler and the most practical Bayesian method for ML HPO. Understanding how it works helps you use it better.
TPE models the search problem as:
where is the probability density of good configurations (those with metric above threshold ) and is the density of bad configurations. The next trial is the configuration that maximizes - the configuration most likely to be good relative to bad.
In practice: TPE fits kernel density estimators on the observed good and bad configurations, then samples from the ratio. The n_startup_trials parameter controls how many random trials to run before switching to TPE - you need enough observations to fit meaningful densities.
Parallelizing HPO with Optuna
# Option 1: Multiple processes on one machine
import multiprocessing
def run_worker(study_name: str, n_trials: int, worker_id: int):
study = optuna.load_study(
study_name=study_name,
storage="postgresql://optuna:password@db:5432/optuna",
)
study.optimize(objective, n_trials=n_trials)
if __name__ == "__main__":
# Create study once
optuna.create_study(
study_name="parallel_hpo",
storage="postgresql://optuna:password@db:5432/optuna",
direction="maximize",
sampler=optuna.samplers.TPESampler(seed=42),
load_if_exists=True,
)
# Launch 4 parallel workers
processes = []
for worker_id in range(4):
p = multiprocessing.Process(
target=run_worker,
args=("parallel_hpo", 50, worker_id)
)
p.start()
processes.append(p)
for p in processes:
p.join()
Hyperband and ASHA: Multi-Fidelity HPO
The core insight of multi-fidelity methods: you do not need to train every trial to convergence. A configuration that is terrible at epoch 5 is almost certainly terrible at epoch 100. Train cheap, discard the bad ones early, train the survivors longer.
Hyperband Algorithm
Hyperband runs multiple brackets, each a "successive halving" schedule:
- Start with configurations, training each for resources (epochs/samples)
- Keep the top configurations (prune the rest)
- Train survivors for resources
- Repeat until resources remain
With , , : start 81 trials at 1 epoch, keep 27, run to 3 epochs, keep 9, run to 9 epochs, keep 3, run to 27 epochs, keep 1, run to 81 epochs. Total cost: ~5x cheaper than running 81 full trials.
ASHA: Asynchronous Successive Halving
Hyperband requires synchronization - all trials in a bracket must finish before promotions happen. ASHA removes synchronization: trials are promoted as soon as they finish and their performance can be compared to peers at the same budget. This is critical for distributed HPO where trials finish at different rates.
# Optuna's HyperbandPruner implements the successive halving logic
# Combined with TPESampler, it gives you BOHB (Bayesian Hyperband)
study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.TPESampler(multivariate=True),
pruner=optuna.pruners.HyperbandPruner(
min_resource=5, # minimum epochs before any pruning
max_resource=100, # maximum epochs (your MAX_EPOCHS)
reduction_factor=3, # keep top 1/3 at each promotion
),
)
# In your objective, report after every epoch and check should_prune()
def objective(trial):
for epoch in range(100):
val_metric = train_one_epoch_and_eval()
trial.report(val_metric, step=epoch)
if trial.should_prune():
raise optuna.TrialPruned()
return val_metric
Population Based Training
PBT is a different approach: maintain a population of models training in parallel, and periodically copy weights from the best performers to the worst performers while mutating their hyperparameters.
PBT finds non-stationary hyperparameter schedules - the learning rate and other hyperparameters change throughout training. This is something grid/random/Bayesian methods cannot find, because they treat hyperparameters as fixed for the entire run.
PBT is most effective for: large-scale deep RL, very long training runs (language model pretraining), situations where learning rate schedules matter more than final values.
# Ray Tune implements PBT cleanly
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
pbt = PopulationBasedTraining(
time_attr="training_iteration",
metric="val_auc",
mode="max",
perturbation_interval=5, # perturb every 5 epochs
hyperparam_mutations={
"learning_rate": tune.loguniform(1e-5, 1e-1),
"dropout": tune.uniform(0.0, 0.5),
"weight_decay": tune.loguniform(1e-6, 1e-1),
},
resample_probability=0.25, # 25% chance of resampling from prior
)
tuner = tune.Tuner(
train_fn,
tune_config=tune.TuneConfig(
scheduler=pbt,
num_samples=16, # population size
),
param_space={
"learning_rate": tune.loguniform(1e-5, 1e-1),
"dropout": tune.uniform(0.0, 0.5),
"weight_decay": tune.loguniform(1e-6, 1e-1),
},
)
results = tuner.fit()
Multi-Objective Optimization
Real ML problems often have multiple objectives: maximize accuracy AND minimize latency, or maximize recall AND minimize false positive rate. Optuna supports multi-objective optimization natively.
def multi_objective(trial: optuna.Trial):
num_layers = trial.suggest_int("num_layers", 1, 12)
d_model = trial.suggest_categorical("d_model", [64, 128, 256, 512])
model = build_model(num_layers=num_layers, d_model=d_model)
val_auc = train_and_evaluate(model)
# Measure inference latency
latency_ms = measure_inference_latency(model, batch_size=32)
# Count parameters (proxy for model size)
n_params = sum(p.numel() for p in model.parameters())
return val_auc, -latency_ms # maximize AUC, minimize latency (negate)
study = optuna.create_study(
directions=["maximize", "maximize"], # both objectives are maximized
sampler=optuna.samplers.NSGAIISampler(seed=42), # NSGA-II for multi-objective
)
study.optimize(multi_objective, n_trials=200)
# Get Pareto front
pareto_trials = study.best_trials
for trial in pareto_trials:
auc, neg_latency = trial.values
print(f"AUC={auc:.4f}, Latency={-neg_latency:.1f}ms | {trial.params}")
HPO Best Practices for Production
1. Define your search space carefully. Log-uniform distributions for learning rates. Categorical for architecture choices. Do not include hyperparameters you are not willing to actually change.
2. Use a budget proportional to the number of hyperparameters. A common heuristic: run at least trials where is the number of hyperparameters. For 5 hyperparameters, run at least 300 trials.
3. Decouple data loading from model building. If your data loading is slow, every trial pays that cost. Cache the dataset in memory or on fast NVMe before starting the sweep.
4. Fix the validation split across all trials. Use the same val set for every trial in a sweep. Shuffle with the same seed. Otherwise you are measuring noise in the data split, not the effect of hyperparameters.
5. Run the best configuration multiple times. The optimal configuration found by a sweep may have been lucky. Retrain the top 3–5 configurations with different random seeds and report the mean and standard deviation of validation metrics.
Common Mistakes
:::danger Running a Grid Search on a Continuous Space If your learning rate can be any value between 1e-5 and 1e-1, never grid-search it. Use a log-uniform distribution with random or Bayesian sampling. Grid points miss the actual optimal value almost every time. :::
:::danger Using the Test Set During HPO Every evaluation of the objective function that uses the test set leaks information about the test set into the hyperparameter choices. Run HPO on validation data only. The test set is touched exactly once: final evaluation of the chosen hyperparameters. :::
:::warning Setting n_startup_trials Too Low
TPE needs enough observed trials to build a meaningful surrogate model. Setting n_startup_trials=5 on a 10-dimensional search space means TPE starts guiding after seeing only 5 random points - nowhere near enough. Use at least 10 * d startup trials where d is the number of hyperparameters.
:::
:::warning Not Using Pruning
Running every trial to completion even when it is clearly underperforming wastes 50–80% of your compute budget. Enable pruning (trial.report() + trial.should_prune()) for every HPO sweep that takes more than 5 minutes per trial.
:::
Interview Q&A
Q: When would you use random search over Bayesian optimization?
A: Random search is preferable when: (1) you have enough compute to run 100+ trials in parallel - Bayesian methods are sequential, random search parallelizes perfectly; (2) your search space is mostly categorical or discrete - Bayesian methods shine on continuous spaces; (3) your objective function is very noisy - Bayesian methods can be led astray by noisy observations; (4) training time is under 5 minutes - the overhead of building the surrogate model is not worth it for very cheap objectives. Bayesian optimization is better when trials are expensive (hours), the search space is continuous, and you have a sequential budget (cannot parallelize).
Q: What is the exploitation vs. exploration tradeoff in Bayesian HPO?
A: The surrogate model (GP or KDE) tells us the expected metric and uncertainty for each unsampled configuration. Exploration samples configurations where uncertainty is high - we learn about unknown regions. Exploitation samples configurations where the expected metric is high - we refine our best guesses. The acquisition function (Expected Improvement, UCB, etc.) controls this tradeoff. Aggressive exploitation finds the local optimum quickly but misses global optima. Aggressive exploration samples the whole space but wastes trials on bad regions. TPE's kernel density approach implicitly balances these by sampling from configurations more likely to exceed the current best threshold.
Q: How does Hyperband achieve better sample efficiency than random search?
A: By early-stopping bad trials. In random search, every trial runs to completion regardless of how poorly it performs. Hyperband runs all trials cheaply (few epochs), then progressively filters out the worst performers at each promotion round. The resources saved from stopped trials are redirected to training the survivors longer. The key theoretical result (Li et al., 2017): Hyperband achieves the same result as the best fixed budget allocation in hindsight, within a log factor. In practice, this means 3x–10x fewer total GPU-hours to find the same quality optimum as random search.
Q: What is the difference between a hyperparameter and a model parameter?
A: Model parameters (weights, biases) are learned from data during training via gradient descent. Hyperparameters are set before training and control the training process - they are not updated by gradient descent. Examples: learning rate, batch size, number of layers, dropout rate, weight decay. The distinction matters for HPO: you cannot use backpropagation to optimize hyperparameters because the gradient of validation loss with respect to hyperparameters is generally intractable (or zero, for categorical choices). Instead you use black-box optimization: evaluate the validation metric for each hyperparameter configuration.
Q: How do you prevent overfitting to the validation set during a large HPO sweep?
A: Several strategies: (1) Use a held-out test set that is never touched during HPO - only used for final evaluation. (2) Use k-fold cross-validation as the objective instead of a single val split - more robust but k times more expensive. (3) Be conservative with the number of trials relative to the dataset size - 200 trials on a 1000-sample val set means you have effectively "seen" the val set 200 times through proxy. (4) Apply regularization via the hyperparameter choices themselves - sweeping over regularization strengths (dropout, weight decay) implicitly guards against overfitting. (5) Statistical testing: compare the winning configuration's val performance against its performance in multiple retrained runs.
