Event Classification with Multi-step Machine Learning

The usefulness and value of Multi-step Machine Learning (ML), where a task is organized into connected sub-tasks with known intermediate inference goals, as opposed to a single large model learned end-to-end without intermediate sub-tasks, is presented. Pre-optimized ML models are connected and better performance is obtained by re-optimizing the connected one. The selection of an ML model from several small ML model candidates for each sub-task has been performed by using the idea based on Neural Architecture Search (NAS). In this paper, Differentiable Architecture Search (DARTS) and Single Path One-Shot NAS (SPOS-NAS) are tested, where the construction of loss functions is improved to keep all ML models smoothly learning. Using DARTS and SPOS-NAS as an optimization and selection as well as the connections for multi-step machine learning systems, we find that (1) such a system can quickly and successfully select highly performant model combinations, and (2) the selected models are consistent with baseline algorithms, such as grid search, and their outputs are well controlled.


Introduction
Machine learning (ML), in particular, deep learning (DL), has evolved rapidly due to the availability of huge computing power and big data and has proven to be successful in many applications such as image classification, natural language translation, etc. In most ML approaches, a single task with a large model learned end-to-end is defined and trained to solve a given problem (see Fig. 1(a)). In most cases, this end-to-end approach provides state-of-the-art performance for a given problem in terms of precision and accuracy. However, we will adopt a different approach, which can still give acceptable precision and accuracy for a given problem: we connect some ML models, each of which can solve a part of a given problem as shown in Fig. 1(b). We call it Multi-step ML. Moreover, in some of Multi-step ML, we can assume that there are several different ML model candidates to solve the same sub-tasks as shown in Fig. 1(c). In this paper, ideas for the connecting of sub-tasks and their model selection are presented.

Multi-step ML
For a given task, we break it into several sub-tasks with known intermediate inference goals, find optimal ML models for each sub-task, resulting in the best model chain. From a different perspective, assuming that there are several tasks with well-defined or well-trained ML models, we solve a new task by combining them. The common interesting point is that there are multiple sub-tasks for a given task. ML models for each sub-task are relatively easy to build, or well-defined or well-trained ML models already exist.
Our approach may result in an interpretability versus accuracy trade-off when compared with end-to-end paradigms. The merits of our approach include: • Domain knowledge is easily introduced into ML models for sub-tasks, and such ML models can be reused in other problems which involve common tasks, • Intermediate data, which is the output of sub-tasks, provide the information to understand the behavior of the ML models, which can lead to an explainability of ML.
The simplest way to connect ML models is just to give the output of an ML model to the input of a next ML model. In this paper, we introduce more effective methods on the connection of ML models. What type of problems can be useful for Multi-step ML?
Sub-tasks have to be defined so that a given problem is required to be expressed as a set of sub-problems, where such sub-problems should be recognized and their role, input and output are defined by a human. Said in this exaggerated manner, this might correspond to problems found in any community. A simple example is from some image classification problems: a sub-task to identify objects in an image and a sub-task to understand the context of the image using the identified objects can be separated [1,2]. From the viewpoint of data flow, input data for Multi-step ML are produced via several steps; in other words, there is a hierarchical structure in the input data. For this type of data, we can solve the problem by defining sub-tasks for each step. Practically, a problem to use data produced by a simulation of any model or theory is optimal since the data could have a hierarchical structure or sub-tasks can be defined easily using supervised information. This matches studies in science fields, for example, experimental particle physics, as shown later.

NAS to select MLs
The situation shown in Fig. 1(c) might happen when there are different models requiring the optimization of ML model structures and hyperparameters. We also expect that the choice of ML in the intermediate steps is not unique, that is, it will depend on the goal of a given task. To select one of the models for each sub-task, in this paper, we use the idea of neural architecture search (NAS).
We demonstrate the usefulness and value of Multi-step ML on (1) the re-optimization of model weights after sequentially connecting multiple ML models, and (2) the selection of an ML model from multiple ML model candidates. For the latter, we adopt the idea of the differentiable architecture search (DARTS [3]) and the single path one-shot neural architecture search (SPOS-NAS [4]) to a task of particle physics, where two sub-tasks are defined to solve a given task. This is called Model Selection with NAS (MSNAS).

Particle Physics
Experimental particle physics aims to understand the fundamental laws of nature and reveal unknown laws using a huge amounts of data. In collider physics experiments, each event 1 of data is produced from collisions using a high energy accelerator. An event has several particles, which are measured with detectors surrounding collision points. The classification of events is quite important in collider physics data analysis, where interesting signal events are separated from background events. ML has been used in collider physics research, for example, boosted decision trees (BDT) for event classification [5], and DL for event classification [6], jet imaging [7], etc.

Related Work
Multi-step ML can be categorized into so-called Automated Machine Learning (AutoML), for example, Ref. [8], where the hyperparameter optimization, meta-learning, NAS, etc. are described and discussed. The scope of AutoML is huge and continues to grow. One of differences from AutoML is that in this paper we focus on the connection of multiple ML models, where not only a task but also sub-tasks are defined by humans because each sub-task has its own purpose.
NAS was introduced to automatically design a network architecture for a given task with the best performance and less human intervention. The purpose of Multi-step ML is different from that of NAS, however, methods developed for NAS can be applicable to Multi-step ML. Several survey documents of NAS are found, for example, in Refs. [9,10]: image classification [11][12][13], object detection [12,14], etc. In some NAS algorithms, large computational resources are required, for example, due to the discrete search space of the architecture with the reinforcement learning. To overcome this issue, the idea of one-shot NAS [15,16] is promising: ENAS [17], DARTS [3,18,19], ProxylessNAS [20], SPOS-NAS [4], SNAS [21] and so on. In DARTS, a differentiable calculation in the search space is introduced using the softmax function. In SPOS-NAS, a supernet training and architecture search is decoupled by using the single path one-shot approach. In NAS, a neural architecture is selected to achieve the best performance for a given task. On the other hand, in Multi-step ML, an ML model is selected by considering the performance of both a given task and its sub-tasks.

Methods
In this paper, the idea of DARTS [3] and SPOS-NAS [4] is used to select one of the ML models. One of the motivations of these algorithms based on NAS is to improve computing complexity. By optimizing all model combinations simultaneously, instead of optimizing all model combinations separately, compute time reduces drastically because of avoiding repetitive model training. In a method based on DARTS, a network consists of parallelly connected ML models. All model weights in the network are optimized simultaneously for each mini-batch. In a method based on SPOS-NAS, a network consists of randomly sampled ML models. Model weights in the randomly selected model combination are optimized for each mini-batch.
Briefly summarizing these algorithms below, we explain what we have additionally done for connecting and selecting models.

Based on DARTS
DARTS represents a search architecture as a graph, where an edge corresponds to an operation. Each edge has an architecture weight (α), which is used to aggregate the outputs o i on the same path with a softmax function o = i∈path softmax(α i ) · o i . After the training of architecture weights, operations that have maximum α among the same path are selected as a final operation set.
We use this idea for model selection. The operations, which are represented as edges, are replaced with models for a sub-task. The outputs of each model are built using architecture weight α, called a model architecture weight hereafter, y t = i∈models softmax(α i ) · y t,i , where y t,i is an output of i-th model for the t-th sub-task. After the training of model architecture weights, the models that have maximum α among sub-tasks are selected as a final model set.
First, a pre-training is applied: every single model is individually trained using ground-truth data. Second, model architecture weights are optimized following the DARTS optimization, where model weights (w) and the model architecture weights (α) are optimized separately. Model architecture weights are updated to be optimal for validation data, while model weights are fitted to training data. Finally, with fixed models selected by model architecture weights, model weights of selected models are re-optimized for training data. The algorithm is outlined in Appendix. A.
A loss function used in DARTS for MSNAS is built using the loss functions of each sub-task: where L t , y true t , and y pred t are a loss function, a ground truth, and a model prediction for the t-th task, respectively. The first term is a loss function for aggregated outputs for each model. This term is necessary for differentiably updating model architecture weights. The second term is a loss function for each model. Without the second term, outputs for each sub-task do not converge to the targets y true , because there are degrees of freedom that can significantly change the output values of individual models while not changing the loss function values due to interference between models connected in parallel. In this study, is fixed 2 to 1, and task weights (v t ) are hyperparameters with a normalization of t∈task v t = 1, which was found to be reasonable values satisfying with both the validity of each model output and the performance of the whole task.
We use an Adam optimizer with a learning rate of 10 −3 for pre-training, model architecture weight determination, and post-training. The training is terminated after 100 epochs or if a valid loss does not decrease in 10 (20) epochs in pre/post-training (model architecture weight determination).

Based on SPOS-NAS
In Ref. [4], in the context of NAS, the supernet optimization (weight optimization) and architecture search are decoupled. Weights are optimized after selecting a single path of architectures with a uniform path sampling. Then, the architecture search is performed using the evolutionary algorithm.
We use this idea for model selection. Model weights in each sub-task are optimized after selecting a single path of models with a uniform sampling. Then, the model search, where the best model is selected for each task, is performed using the grid search algorithm instead of the evolutionary algorithm since the number of models is small in this study. The algorithm is outlined in Appendix. A.
A loss function used in SPOS-NAS for MSNAS is built using the loss functions of each model: where i-th models for the t-th sub-task are randomly sampled. In this study, task weights (v t ) are hyperparameters like DARTS in MSNAS, with a normalization of t∈task v t = 1. The optimizer used and the strategy of training termination are the same as the DARTS method.

Experiments and Results
A toy problem from experimental particle physics is prepared to prove the concept of Multistep ML. All datasets have been generated by Monte-Carlo simulation.

Problem and task
The problem used as an experiment in this paper is the classification of particle origin: one is a Higgs boson (H) and the other is a Z boson. The main differences between the two particles are their mass (125 GeV for H, 91 GeV for Z) and spin (0 for H, 1 for Z). Both particles promptly decay into a pair of τ-leptons (H/Z → τ ± τ ∓ ) with some probability, and the τ-leptons then decay into various particles, leaving an energy deposit in the detectors. With the signature left in the detector, τ-lepton candidates are reconstructed, then the particle origin is identified. In this paper, we separate this problem into two parts and define sub-tasks: the first task (Task 1) is the energy calibration (measurement) of a τ-lepton candidate, and the second one (Task 2) is the classification of H/Z using a pair of τ-lepton candidates where the input is the output from Task 1. A loss function of Task 1 is defined as mean squared errors of τ-lepton momentum where p 1 (p 2 ) is a leading (sub-leading) τ-lepton momentum. For the Task 2, a binary cross-entropy loss is considered. For stable training, the output of Task 2 is defined as logits instead of probability (i.e. sigmoid(logits)) and used instead of the formula above. Output aggregation and loss function, therefore, are defined as y = i∈models softmax(α i ) · y i (DARTS method), In our problem setting, the two loss functions cannot be treated as having equivalent statistics: the loss of Task 1 is a χ 2 assuming a momentum resolution of 1 GeV, while the loss of Task 2 can be regarded as a negative log-likelihood based on Bernoulli distribution. To match the scale of two loss functions, the loss of Task 1 is scaled by 10 −4 , i.e. the momentum is normalized by 100 GeV, in our experiments.

Dataset
The data was produced with particle physics simulation programs 3 . We use only hadronic τ-leptons that decay into hadrons, not electrons nor muons. Task 1 uses reconstructed-level jet 4-vectors and calorimeter/tracker information as input variables, then predicts truth-level τ-lepton momentum. Calorimeter/tracker information is given as 16x16 pixel images, which is expected to be used to estimate the momentum of neutrino from τ-leptons to calibrate τ-lepton momentum. The input/output formats of Task 1 and Task 2 and the pixel image examples are found in Appendix B. The transverse momentum p T of Task 1 and Task 2 used in ML models is normalized by p T ← log(0.1 + p T (GeV)), to fit the values into a reasonable range for machine learning algorithms. We have 50,000 events for both H and Z, where 60% for training, 20% for validation and 20% for test.

Models
Three kinds of models are prepared for each task. For Task 1, Multi-Layer Perceptron (MLP), CNN and a linear transformation method, called a scale factor method (Sf ) hereafter, are defined. For Task 2, MLP, Long Short-Term Memory (LSTM), and a simple mass method (Mass) are defined. MLP, CNN and LSTM models are typical deep learning models, while Sf and Mass models are based on conventional methods used in collider particle physics. They are robust compared to deep learning models and are expected to be not the best models because of their simplicity.
MLP and CNN models for Task 1 consist of two blocks: image feature extraction and correction factor evaluation (see Appendix C). The second block is designed to output a momentum residual like ResNet [24]. CNN has a good domain bias for image recognition, while MLP does not. The MLP model for Task 1 is expected to be overfitted due to its large number of trainable parameters in this problem. A Sf model for Task 1 applies a linear transformation ( f (x) = ax + b) for each variable (p T , η, φ) 4 .
An MLP model for Task 2 is a simple deep neural network with three hidden layers with 32 nodes. An LSTM model for Task 2 is built by three stacked LSTM modules with 32 hidden nodes. Two τ-leptons, ordering by jet p T , are sequentially fed to the LSTM module. A Mass model for Task 2 calculates a system mass 5 of two τ-lepton candidates, then applies it to an MLP with 2 hidden layers with 64 nodes.
In all models above, ReLU is used as an activation function. The model hyperparameters, e.g. the number of layers, are determined by scanning them for each the single model.
In addition, two models: Zeros (the output is always 0) and Noise (Gaussian noise ∼ N(µ = 0, σ 2 = 1)) are prepared as dummy models and are used in the model selection studies. We expect that if DARTS works well, these models should not be selected. On the other hand, SPOS-NAS cannot have dummy models since weights of models (MLP, etc.) are largely affected if such dummy models are included in a single-path. Experiments on SPOS-NAS are performed for models not including these dummy models.
Before performing any studies, each model is pre-trained using ground-truth data from the simulation as explained in Section 3.

Re-optimization of ML models used in multiple steps
We present the usefulness of the re-optimization of model weights with the sequentially connected with multiple trained ML models. We execute experiments with two strategies: Without re-optimization : Train a Task 1 model, then train a Task 2 model using the outputs of the Task 1 model.
With re-optimization : Train Task 1 and Task 2 models separately using ground-truth data from the simulation, then build a connected model and train the model.
A performance (AUC) for Task 2 is measured for all model combination (excluding dummy models) with and without the re-optimization as shown in Fig. 2, where experiments are executed 20 times with different random seeds for the same dataset. For the re-optimization model, v 1 is set to zero 6 . Re-optimization after pre-training improves the performance of the final task for any model pairs. Pairs of (CNN, MLP) or (CNN, LSTM) have the highest AUC values in this experiment, and should be selected in MSNAS.

Explainability
By splitting a large problem into sub-tasks, Multi-step ML is able to access intermediate states with readable forms. To evaluate the interpretability of intermediate states, we measure the fraction of outliers of the Task 1 output. As a reference model that predicts robust and controlled outputs, we use a Gaussian process (GP), which is a Bayesian machine learning technique predicting expected values with their uncertainties. A Gaussian process is implemented using GPyTorch package, and it is trained using the same training data. A validity of Task 1 output is defined as the fraction of events where a model prediction of Task 1 output is within the two sigma range predicted by Gaussian process 7 . A validity of Task 1 is shown in Fig. 3 (c) as a function of Task 1 weight (v 1 ). Strong constraints on Task 1's loss result in similar predictions with a Gaussian process model. Less constraints, on the other hand, give different predictions from the Gaussian process model over the uncertainties. Such a prediction cannot be regarded as the one expected, e.g. particle momentum in this case. A proper setting of task weights is required for the model to be explainable.   Figure 5 shows which models are selected in this experiment as a function of Task 1 weight v 1 . Grid search and SPOS-NAS select (CNN, LSTM) and (CNN, MLP) pairs in this order, while DARTS selects these two model pairs but with different fractions. Considering the difference of AUC between these pairs is small as shown in Fig. 2, DARTS can have a practical model selection power, but might have worse model selection ability than grid search and SPOS-NAS methods.

Selection of an ML model from multiple ML models in each step
Performance of Task 1 (MSE) and Task 2 (AUC) is shown in Figs. 6 (a) and (b) as a function of Task 1 weight v 1 . As expected, as Task 1 weight v 1 is larger, the performance of Task 1 improves while that of Task 2 (AUC) becomes worse. There is, however, a moderate point where the performance for the both tasks does not change largely from the best point. MSNAS gives a nearly optimal prediction for the last task with the intermediate data under control. A validity check by Gaussian process prediction is performed as shown in Fig. 6 (c). Grid search and SPOS-NAS have similar performance, while DARTS has a quite different prediction compared to a Gaussian process at small v 1 , which will be investigated further.

Scalability
The compute complexity of a grid search is expressed by O( t∈tasks N model,t ), i.e. O(N 2 models ) in our experiment, from the number of combinations for model pair. On the other hand, the compute complexity of DARTS and SPOS-NAS is O(N tasks N models ) because these models are trained simultaneously.
We check the compute complexity as a function of the number of models. In this experiment, the same models as explained in Section 4.3 with dummy models excluded are used, i.e. the number of models for each task is three. To measure the scalability with a large number of trainable models, the models above are replicated with different initialization, where model weights of replicated models are not shared. Wall time for grid search, DARTS and   Fig. 7 (a) as a function of the number of models per task, where two different sized datasets are used to measure the wall time within the reasonable execution time.

SPOS-NAS is shown in
The dependency of the number of models follows our expectation: O(N models ) for DARTS and SPOS-NAS and O(N 2 models ) for grid search. SPOS-NAS uses grid search instead of a evolutionary algorithm after the one-shot NAS, resulting in changing the power law from O(N models ) to O(N 2 models ). This will be relaxed if a evolutionary algorithm is used. Performance of model prediction of Task 2 is stable against the number of models as shown in Fig. 7 (b). The DARTS and SPOS-NAS methods have good scalability for multiple-step machine learning problems.

Discussion
We have performed MSNAS based on NAS techniques (DARTS and SPOS-NAS) to connect and select ML models. From the perspective of performance, an ensemble of several ML models or different weight initialization, where all trained MLs are used for inference, may improve the performance of the final prediction. The purpose of Multi-step ML, however, is different as we introduced in Section 1: sub-tasks are defined for a given a large problem, which might lead to making ML models simpler. The number of model parameters can be reduced by selecting one of models with good performance, resulting in faster inference with lower computing resource requirements, e.g. memory. Splitting into and defining sub-tasks instead of building a large single task increases the explainability, where all the outputs of sub-tasks can be accessed. Moreover, it might be possible to know a proper domain for the problem from which a model is selected, e.g. if a LSTM model is selected, the problem has a strong correlation for a sequential structure.
Considering the demand from the machine learning community, hyperparameter optimization is also a hot topic. MSNAS is able to select optimal hyperparameters as well as optimal models, by setting multiple models with different hyperparameters as model candidates in a task. This method is scalable if each model has the small number of hyperparameter combination, e.g. O(N task N hp_comb. ) where each model has N hp_comb. hyperparameter combination. However, it is not scalable if models have many types of hyperparameters, e.g. O(N task N N hp_grid hp ) when each model has N hp types of hyperparameters which have N hp_grid states. Practically, the application for the hyperparameter scan requires our method to work well for discrete parameters. This is a future challenge.
The task weights (v i ) are treated as a hyperparameter in this study. This parameter determines how much we respect the intermediate outputs. On the other hand, we can treat them as floating parameters like the studies used in multi-task learning [25][26][27][28][29][30]. This is also a future challenge.
For the model selection, SPOS-NAS has nearly the same performance as grid search, but DARTS does not. This is under investigation. Actually, our experiment is based on one specific problem, which is not familiar in the computing science field, so that we need more problems to test if our results are general.
For collider particle physics, the toy model used in this paper is a simplified model to demonstrate MSNAS methods. In a more realistic case, there are more tasks to be considered, i.e. a tau identification task, or there is room to extend the object types and topology, e.g. b-jets, photon or leptons. We plan to integrate such models and objects for MSNAS to be more practical in the future.
The usefulness and value of Multi-step ML are presented in this paper. The re-optimization after connecting multiple ML models gives better performance than the no re-optimization case. The selection of a single ML model has been performed by using the idea of DARTS and SPOS-NAS, where the construction of a loss function is improved to keep all ML models smoothly learning. Using DARTS and SPOS-NAS as an optimization and selection as well as the connecting for multi-step machine learning systems, we find that (1) such system can quickly and successfully select highly performant model combinations, and (2) the selected models are consistent with baseline algorithms such as grid search and their outputs are well controlled. Our idea has been tested for one specific problem so that we need more problems to test if our results are general.

B Details of Dataset
The input/output formats of Task 1 and Task 2 is summarized in Tables 1 and 2, respectively. Calorimeter/tracker information is given as 16x16 pixel images as shown in Figs. 8.   Figure 9. Models architecture of (a) MLP model and (b) CNN model used in Task 1. Both models consist of two blocks: image feature extraction and correction factor evaluation. 8 The energy and 3-momentum p of a particle form a Lorentz vector (E, p x , p y , p z ) and can be converted into p T = p 2 x + p 2 y , η = −0.5 ln (1 − cos θ)/(1 + cos θ), cos θ = p z /| p|, φ = atan2(p y , p x ), m = E 2 − p 2 x − p 2 y − p 2 z