The use of Boosted Decision Trees for Energy Reconstruction in JUNO experiment

The Jiangmen Underground Neutrino Observatory (JUNO) is a neutrino experiment with a broad physical program. The main goals of JUNO are the determination of the neutrino mass ordering and high precision investigation of neutrino oscillation properties. The precise reconstruction of the event energy is crucial for the success of the experiment. JUNO is equiped with 17612 + 25600 PMT channels of two kind which provide both charge and hit time information. In this work we present a fast Boosted Decision Trees model using small set of aggregated features. The model predicts event energy deposition. We describe the motivation and the details of our feature engineering and feature selection procedures. We demonstrate that the proposed aggregated approach can achieve a reconstruction quality that is competitive with the quality of much more complex models like Convolution Neural Networks (ResNet, VGG and GNN).


Introduction
JUNO is a neutrino observatory under construction in southern China. Its physical program covers a wide range of problems [1]. The main goals are to determine the neutrino mass ordering and to accurately measure the parameters of neutrino oscillations sin 2 θ 12 , ∆m 2 21 , ∆m 2 31 . JUNO will detect reactor neutrinos from the Yangjiang and Taishan nuclear power plants. Simultaneously JUNO will be able to observe neutrinos from supernovae, atmospheric neutrinos, solar neutrinos and geoneutrinos. Figure 1 shows the detector design. The detector is a transparent acrylic sphere with a diameter of 35.4 meters that is located underground in a cylindrical water pool. The sphere is filled with 20 kt of liquid scintillator. The detector is equipped with a huge number of photomultiplier tubes (PMTs) of two types: 17 612 large PMTs (20 inches) and 25 600 small PMTs (3 inches). Neutrinos, which are produced in nuclear reactors, interact with the protons of the scintillator in the detector via the inverse beta-decay (IBD) channel: ν e + p → e + + n. The scintillator then produces visible light upon the interaction of the ejected positron with the media. The amount of emitted photons is tightly related to the neutrino energy. The neutron, after some time, is captured by a hydrogen atom of liquid scintillator, producing 2.2 MeV de-excitation gammas. Thus, the time coincidence of signals from the positron and the neutron makes it possible to separate the event from backgrounds. The information collected by PMTs is used for estimation of the neutrino energy.
To resolve the neutrino mass ordering the energy resolution must be σ 3% at 1 MeV, which is very close to the statistical limit corresponding to the light yield in JUNO, about 1300 detected photons (hits) at 1 MeV. The energy nonlinearity uncertainty should be < 1% [1].
Machine Learning (ML) methods are very popular in science today, including high energy physics, in particular, neutrino experiments [2] and collider experiments [3]. We use ML approach for energy reconstruction in the JUNO experiment. Our problem is a regression supervised learning problem. The data (time and charge information) collected by PMTs is used as input for supervised training of ML model. Earlier we demonstrated that the ML approach can have the quality required for the JUNO experiment on our data and also has the advantage of speed and ease of application [4].
In this work we use Boosted Decision Trees (BDT) [5] for energy reconstruction in the energy range of 0-10 MeV covering the region of interest for IBD events from reactor electron antineutrinos. Compared to [4] we designed and studied new features and achieved much better resolution with BDT, which is now comparable to the resolution of more complex models.

Data description
The dataset is generated by the full detector Monte Carlo method using the official JUNO software [6]. The detector simulation is based on the Geant4 framework [7] with the geometry implemented in detail. The train and test datasets are described as follows: 1. Training dataset consists of 5 million events, uniformly distributed in kinetic energy from 0 to 10 MeV and in the volume of the central detector (in liquid scintillator). Our data have four configurations: 1) without electronics effects; 2) taking into account the transit time spread (TTS) of PMTs; 3) taking into account the dark noise (DN) of PMTs; and 4) taking into account both effects. TTS occurs due to the stochasticity of the path of photoelectron from the photo-cathode to the anode and effectively smears the time information. DN effect gives spontaneous hits on PMTs. In further TTS and DN are always enabled if not specified otherwise. Figure 2 illustrates an example of accumulated charge in the PMT channels (left) and the evolution in time of the same signal in terms of the first hit time distribution (right).

Boosted Decision Trees
BDT is the an ensemble model, where a simple and quickly learning Decision Tree (DT) model is used as the base algorithm. DTs in BDT are trained sequentially. Each subsequent DT is trained to correct errors of previous DTs in the ensemble. In this work we use the XGBRegressor implementation of BDT from the XGBoost library [8].
DT is built recursively starting from the root node, splitting the source set into two subsets (left and right) based on the values of the input features. To build a tree, we need a principle based on which we will split the original set of objects into subsets. XGBRegressor uses Gain maximization to splitting input data into subsets.
In XGBoost the objective function contains a two parts: the training loss and the regularization term: A tree is penalized if the sum of the norm of values in its leaves is very large. Therefore, the regularization term is introduced here as follows: where T is the number of leaves, ω j are values in the leaves, γ and λ are numerical parameters of the regularization. In [8] the authors showed that the optimization of an objective function (1) reduces to maximizing of the Gain. And Gain is defined as: where G, H are the corresponding sums of the first and second derivatives of the objective function for a given partition and the indices l and r mean the left and right partition.

Feature Engineering
The basic features for the energy reconstruction are the following aggregated features: 1. Total number of detected photo-electrons (hits): nHits.
In the first approximation, the total number of hits is proportional to the event energy.
2. Coordinate components of the center of charge: and its radial component: Coordinate components of the center of charge are rough approximations of the location of the energy deposition. These features are important for energy reconstruction since the number of hits depends on the location of the energy deposition.
3. Coordinate components of the center of first hit time: and its radial component: Here the constant c is required to avoid division by zero. These features bring extra information on the location of the energy deposition.
4. Mean and standard deviation of the first hit time distributions: ht_mean, ht_std.
For ML models including Boosted Decision Trees, it is often useful to engineer new features from the existing features [9]. We use the following extra synthetic features: J cc = R 2 cc · sin θ cc , ρ cc = x 2 cc + y 2 cc .
And some trigonometric functions of angles θ cc , φ cc : sin θ cc , cos θ cc , sin φ cc , cos φ cc . We also use 11 similar features for the center of first hit time.
In addition, we prepare five more features related to the location of the PMT received the maximum number of photo-electrons: x_max, y_max, z_max, theta_max, phi_max, the maximum number of photons on PMT npe_max, and the average number of photons on PMTs npe_mean.
Also we added the following features: entries1, entries2. Here, entries1 is the percentage of PMTs with only 1 hit, entries2 -with 2 hits. And the one more feature is nPMTs -the total number of fired PMTs. Now let's take a closer look at the first hit time distribution. Consider what fraction of fired PMTs received at least one photon depending on time, which is, in essence, the cumulative distribution function (CDF) of the first hit time distribution. Figure 3a illustrates an example of a 7 MeV event. The entire event typically lasts for about 1000 ns, but the majority of photo-electrons are recorded by the PMTs in the first hundred nanoseconds and then mainly dark hits are recorded.
In Figure 3a one also can see that at the beginning of events there is a short period of time ∆t during which the photons have not reached the PMTs and only dark noise is recorded. Figure 3b shows how the CDFs for the events with the energy of 7 MeV change depending on their location, closer to the edge (large R) or closer to the center of the detector (small R). In the case where R is quite small, it takes more time for the photons to be detected by PMTs and also it takes more time to "saturate".   Finally, the idea is to simply decompose the entire curve into a set of percentiles, and then select those that suit best for energy reconstruction. X%-percentile indicates how long it takes to register X% of the first PMT hits. We use the following set of percentiles: {1%, 2%, ..., 10%, 15%, ..., 90%, 91%, ..., 99%}.

Selection of event time window
One can also see in Figure 3 that the signal hits arrive within the first few hundred nanoseconds, while the dark hits form quasi-constant pedestal. Therefore, a time window can be selected, based on the data, in a way to contain mainly signal events. For this purpose we trained Boosted Decision Trees model with different bounds of window: {75ns, 125ns, 175ns, 250ns, 500ns, 750ns, 1500ns} and always started from t = 0. It was trained on a 200k dataset and using all new features. Table 1 shows the Mean Absolute Error (MAE), the Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE) on the test dataset for different window bounds. The best one was found to be 500 ns, however all the windows longer than 175 ns showed similar performance.

Feature selection
Finally, we have a large set of features, but many of them are highly correlated, so we expect that a small set of them contain all the information and provide a performance close to the best possible one. Thus, the next task is to get a subset of the most informative features from the available set of all 78 features. For this purpose we use a dataset with 1M events. To select the most informative features, we use the following algorithm. First, we train a model on all features and computed RMSE on the validation dataset with 150k events. Then we take an empty list and start populating it with features. On each step we pick a feature which provide the best improvement of the model in terms of RMSE calculated on the validation dataset and put it to the end of the list. We continue while this RMSE value differs from the RMSE value for the model trained on all features by more than ε, chosen to be 0.0002. This procedure results to the following set of features (sorted by importance): nHits, ht_20p, jacob_cc, ht_2p, ht_35p, R_cc, ht_75p Not surprisingly, for the energy reconstruction, the most informative feature is the total number of hits nHits, because its strongly correlated with energy, but at the same time it is hard to interpret the order of the rest features. The subset of the selected features contains jacob_cc and R_cc, which bring the spatial information allowing to recover the non-uniformity of detector response. We checked simpler combinations of features for the center of charge position. It turned out that the combination of rho_cc and R_cc gives the same result as jacob_cc and R_cc, so we have chosen them as they are more intuitive. Our final set of features is: nHits, ht_20p, rho_cc, ht_2p, ht_35p, R_cc, ht_75p Figure 4 illustrates the selected percentiles of the CDFs for an event with energy of 7 MeV and for different radial positions. As one can see ht_2p contains information about the beginning of the event, that is about the moment when the number of photons PMT hits begins to grow sharply. The remaining percentiles contain information about the shape of the CDF curve and help us to separate one curve from another. The 75% percentile is close to the moment of "saturation" of the CDF curve.

Results
To evaluate the quality of the model, we use two metrics: resolution and bias. These metrics are obtained as a result of the Gaussian fit of E pred − E true distribution. The resolution is defined as σ/E true and the bias -as µ/E true , where σ and µ are the standard deviation and the mean of the Gaussian distribution respectively. The performance is shown dependent on the so called visible energy, i.e. the maximal energy that can be converted into light: E vis = E kin + 1.022 MeV. This procedure is described in more details in [4]. Figure 5a illustrates the results for BDT models trained on datasets that contain different amount of events: 100k, 1M, 5M. We obtained that 1M events can provide the best possible accuracy of the model, providing only a little improvement compared to 100k events. This illustrates that fast learning is one of the advantages of the BDT model: one can get an acceptable quality already on a relatively small number of events in the dataset. Figure 5b shows a comparison for the BDT model trained on the 5M dataset for different options: without TTS & DN, with TTS only, with DN only, with TTS & DN. One can see that DN worsens the resolution, TTS -almost does not.
A comparison of the BDT model with other more complex deep learning models (ResNet, VGG, and GNN) [4] is shown in Figure 6. All the models are trained on the dataset with 5M events. We can see that the performance of the energy reconstruction with BDT model is practically similar to the complex deep learning models. At the same time the computations required for training and prediction are much faster due to the minimalistic nature of BDT.

Summary
In this work we have presented the use of Boosted Decision Trees for energy reconstruction in the JUNO experiment in the relevant energy range. We have designed and investigated a large set of features and have selected a small subset providing the performance nearly equal the one obtained with the full set of features. Using such a minimalistic and fast model as BDT we achieved a performance similar to the one of more complex models like ResNet, VGG, GNN.