Model Performance Prediction for Hyperparameter Optimization of Deep Learning Models Using High Performance Computing and Quantum Annealing

Hyperparameter Optimization (HPO) of Deep Learning-based models tends to be a compute resource intensive process as it usually requires to train the target model with many different hyperparameter configurations. We show that integrating model performance prediction with early stopping methods holds great potential to speed up the HPO process of deep learning models. Moreover, we propose a novel algorithm called Swift-Hyperband that can use either classical or quantum support vector regression for performance prediction and benefit from distributed High Performance Computing environments. This algorithm is tested not only for the Machine-Learned Particle Flow model used in High Energy Physics, but also for a wider range of target models from domains such as computer vision and natural language processing. Swift-Hyperband is shown to find comparable (or better) hyperparameters as well as using less computational resources in all test cases.


Introduction
Training and Hyperparameter Optimization (HPO) of Deep Learning (DL) models is often compute resource intensive and calls for the use of large-scale High Performance Computing (HPC) resources as well as scalable and resource efficient Hyperparameter (HP) search and evaluation algorithms [1].Current state-of-the-art HPO algorithms such as Hyperband [2], ASHA [3], and BOHB [4], rely on a method of early termination.Badly performing trials are automatically terminated to allocate compute resources to more promising ones.Such methods have been successfully applied to optimize Machine-Learned Particle Flow (MLPF), a particle flow reconstruction Neural Network (NN) [5].Using this technique led to o a reduction of ∼44% in the validation loss of MLPF [1].
In this context, performance prediction emerges as a potential approach to accelerate the HPO process.This involves using meta-models, referred to as a performance predictors, that can estimate the performance of a given configuration at a particular epoch by leveraging information from its partial learning curve.By employing performance prediction, it is possible to prioritize the training of the most promising configurations based on their predicted performance, while avoiding the need to fully train configurations with poorer predicted performance.Consequently, this approach holds great potential for reducing the time and computational resources required for the HPO process.This work explores novel techniques based on performance prediction to accelerate the HPO process of MLPF and other NN architectures that leverage the use of HPC resources for training the target model and quantum computing for training the performance predictors.Moreover, a new HPO algorithm is proposed, Swift-Hyperband, that integrates Hyperband with the use of model performance predictors.

Related work
Baker et al. [6] demonstrated that Support Vector Regression (SVR) models can effectively serve as performance predictors for various NN architectures.In addition to showing a good predictive capability for performance prediction tasks, SVR models offer the advantage of having negligible training and inference times, even when using a consumer-grade laptop CPU.Hence, using SVRs prevents the training of performance predictors from becoming a bottleneck for the potential resource savings expected from this technique.
In the European Center of Excellence in Exascale Computing "Research on AI-and Simulation-based Engineering at Exascale" (CoE RAISE) the capability of SVRs to predict the loss of MLPF after 100 training epochs in the Delphes dataset [7] was successfully shown [8], achieving R 2 values of around 0.9 when using 25% of the target learning curve as input.
Here, R 2 is the so-called coefficient of determination, defined as where y i is the ground truth for data point i, f i is prediction i, ȳ is the mean of all y i and e i is the error of prediction i.Furthermore, the Quantum Annealer at the Jülich Supercomputer Centre, was used to train Quantum Support Vector Regression (QSVR) [9] models on MLPF learning curves.While no significant performance benefit was expected from the use of quantum resources, and the idea of employing quantum computers for the task of performance prediction was primarily a proof of concept of integrating this technology into the HPO workflow, the QSVRs achieved comparable performance to that obtained with classical SVRs [8].Note that the number of training samples in this case had to be reduced to 20 due to limitations in problem size arising from the current state of quantum technologies.
There are several strategies that can be considered for integrating performance predictors with the HPO process [6].A straightforward approach is to generate a certain amount of random HP configurations, fully train M configurations and partially train N > M configurations.The final loss of the fully trained configurations and part of their learning curves are used to train a performance predictor.Then, this performance predictor is used to predict the final loss of the partially trained configurations and only those whose predicted loss is below a certain threshold are selected to complete training.This approach was tested using different types of performance predictors, including QSVRs, in [10].Another, more sophisticated example of the use of performance predictors for HPO is the algorithm Fast-Hyperband [6].This algorithm is a modified version of the well-known Hyperband algorithm that uses performance predictors which are trained on the fly during the execution of the algorithm to save training epochs of the target model with respect to Hyperband.More precisely, Fast-Hyperband adds an extra decision point1 based on performance prediction for every epoch in each Hyperband round.Decisions in these new intermediate points use a probability threshold, computed from an estimate of the standard deviation of every predictor used.The proposed method of computing this estimate in Fast-Hyperband is leave one out cross validation.The main drawback of this approach is that it requires to train many performance predictors, which makes it impractical to use QSVRs.This is partly due to runtime limitations on the quantum machine and partly due to the time needed to formulate the regression problem in a suitable way for the quantum annealer.In addition, the time spent to connect to the quantum machine needs to be taken into account.Furthermore, Fast-Hyperband, as it is defined in [6], is a sequential algorithm which makes it unable to benefit from running in a distributed manner on multiple nodes in an HPC environment.Figure 1 illustrates what happens inside each bracket of Swift-Hyperband.The vertical dashed lines represent the new decision points in which some configurations or trials are discarded based on their predicted performance.The trials that did not complete their round are represented using dashed learning curves.That is, the dashed learning curves represent training epochs of the target model that Swift-Hyperband saved with respect to the classical Hyperband algorithm.The trials that complete their training until the end of the round are the ones that are always represented by continuous lines.The decision of which of these trials are promoted to the next round is made as it is made in the classical Hyperband algorithm.

Swift-Hyperband
Swift-Hyperband not only needs to train less performance predictors than Fast-Hyperband but can also be easily parallelized, as all the initial full and partial trainings inside a round can be executed in parallel.For these two reasons, our algorithm can potentially use QSVRs and benefit from HPC environments.A schematic comparison between Fast-Hyperband and Swift-Hyperband is shown in Figure 2.

Results
To compare Hyperband, Fast-hyperband, Swift-Hyperband and Quantum-Swift-Hyperband (Swift-Hyperband using QSVRs) for different NN architectures we simulate 10 runs of each algorithm using the datasets of learning curves derived from the following model-dataset combinations: • MLPF [8] trained on the Delphes dataset [7].
• An image recognition Convolutional Neural Network (CNN) modified from [3] trained on the Cifar10 dataset.• An image recognition CNN trained on the SVHN dataset used in [6].
• A natural language processing Long Shot-Term Memory (LSTM) NN trained in the PTB dataset [11].
The result of these simulated runs can be seen in Figure 3 and a summary of the learning curve datasets used for the simulation is available in Table 1.
Beyond the simulated runs, we test the speedup provided by the parallelization of Swift-Hyperband along with the achieved accuracies by running Hyperband, Fast-Hyperband, Swift-Hyperband, and a parallel version of Swift-Hyperband that uses MPI to coordinate one CPU node and two GPU worker nodes.For these runs, the HPO target was a simple 6-layer CNN (different to the CNN used in the simulated runs) trained on Cifar10 using a 3-dimensional search space consisting of learning rate, weight decay, and dropout.This network was chosen because it was relatively fast to train.The results can be seen in Figure 4.
The results in Figure 3 and Figure 4 show that both Swift-Hyperband and its version using QSVRs achieve accuracies comparable to classical Hyperband while needing considerably  fewer epochs in all cases.In comparison to Fast-Hyperband, Swift-Hyperband (SVR and Q-SVR) is faster in all cases except on the SVHN problem.When it comes to the nonsimulated runs we observe that all algorithms achieve accuracies around 87%, with both Swift-Hyperband and Parallel-Swift-Hyperband slightly beating Fast-Hyperband.Note that in several cases the version of Swift-Hyperband that uses QSVRs finds better performing configurations than the original Hyperband algorithm, something that would not happen if the QSVRs made perfect predictions.This may indicate that using performance predictors,  aside from saving compute resources, has some type of regularizing effect that prevents some of the errors that Hyperband makes when terminating configurations at the end of each round.

Conclusions
We proposed a new promising parallelizable HPO algorithm integrating Hyperband and performance predictors that can be used in combination with SVRs or QSVRs.This leaves the door open for the use of Swift-Hyperband in later HPO cycles of MLPF.Furthermore, it was shown that, despite the current limitations of quantum computers, it is possible to execute hybrid Quantum/HPC workflows for HPO, achieving comparable performance to fully classical workflows.We consider that there is a need for further studies on the speedup achieved by the parallelization of Swift-Hyperband when using a greater number of nodes as Hyperband is known to suffer from straggler issues [3].Hence, developing a version of the HPO algorithm ASHA that integrates performance predictors, potentially to be named Swift-ASHA, is one of the identified lines to continue this work.In addition to this, conducting more empirical tests of Swift-Hyperband on a wider range of target models would provide valuable insights on the behavior of the algorithm.Finally, conducting theoretical studies on SVRs and QSVRs could provide a deeper understanding of why Swift-Hyperband occasionally outperforms the original Hyperband algorithm and shed light on the differences observed when employing quantum or classical SVRs.

Figure 1 :
Figure 1: Graphic representation of a Swift-Hyperband bracket.This is an illustrative example, not the result of an actual execution the algorithm.

Figure 3 :
Figure 3: Average, best, and worst performance of the best configuration found as well as the total number of epochs consumed by the different HPO algorithms for different NNs and datasets.

Figure 4 :
Figure 4: Performance of the best configuration found and total runtime needed for different HPO algorithms.

Table 1 :
Summary of learning curve datasets.