Hyperparameter optimization

Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure.

Grid search suffers from the curse of dimensionality, but is often embarrassingly parallel because the hyperparameter settings it evaluates are typically independent of each other.

It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm.

Despite its simplicity, random search remains one of the important base-lines against which to compare the performance of new hyperparameter optimization methods.

By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum.

[17][18][19][20] A more recent work along this direction uses the implicit function theorem to calculate hypergradients and proposes a stable approximation of the inverse Hessian.

Self-tuning networks[23] offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork.

Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters.

On the contrary, non-adaptive methods have the sub-optimal strategy to assign a constant set of hyperparameters for the whole training.

Irace implements the iterated racing algorithm, that focuses the search around the most promising configurations, using statistical tests to discard the ones that perform poorly.

Asynchronous successive halving (ASHA)[34] further improves upon SHA's resource utilization profile by removing the need to synchronously evaluate and prune low-performing models.