Using machine learning to speed up new and upgrade detector studies: a calorimeter case

In this paper, we discuss the way advanced machine learning techniques allow physicists to perform in-depth studies of the realistic operating modes of the detectors during the stage of their design. Proposed approach can be applied to both design concept (CDR) and technical design (TDR) phases of future detectors and existing detectors if upgraded. The machine learning approaches may speed up the verification of the possible detector configurations and will automate the entire detector R\&D, which is often accompanied by a large number of scattered studies. We present the approach of using machine learning for detector R\&D and its optimisation cycle with an emphasis on the project of the electromagnetic calorimeter upgrade for the LHCb detector\cite{lhcls3}. The spatial reconstruction and time of arrival properties for the electromagnetic calorimeter were demonstrated.


Introduction
The calorimeters are an essential part of most of the existing and developing detectors in high energy physics.The high luminosity delivered by the collider causes a high multiplicity and hit occupancy in the calorimeter.In order to operate under such conditions a new generation of the calorimeters is characterised by high granularity (increased number of channels) and by the ability to measure the time of arrival of the particles to mitigate pile-up.
To obtain the planned physical performance during the R&D of modern experiments in HEP, the detailed Geant4 simulation [2] of the calorimeter is necessary.Such simulations are computationally expensive taking into account the large number of channels and the variety of possible options in the calorimeter module technologies, in the modules arrangement, in the reconstruction of the attributes of physical objects, etc.The optimisation cycle, within the calorimeter R&D, comprises several computationally intensive elements, such as shower development and particle transport.Processes of multi-parametric optimisation appear to be also expensive.These factors make new approaches to calorimeter development necessary.Machine learning allows a quick turnover for the optimisation cycle, when parameters are changed, and eliminates manual work for re-tuning the simulation and reconstruction.
The challenges of the calorimeter R&Ds push the researchers to apply machine learning instead of parametric approaches.The recent efforts of based on Generative Adversarial Networks to the simulation of calorimeter showers prove them as good candidates for use in faster simulation in high energy particle physics [5][6][7].The idea of applying machine learning (deep learning) to the calorimeter reconstruction as well as the shower simulation is demonstrated in the paper [4].The paper, among other things, shows the applicability of the model based on machine learning reconstruction to shower inputs from several detector geometries.

Spatial Reconstruction
To reconstruct the hit position of the particle reached calorimeter, we implement an approach which is based on Pythia8-generated events of B 0 s → J/ψ(→ µ + µ − )π 0 (→ γγ) (hereinafter signal sample), generated with default LHCb tunings and on a data of Geant4-simulated events in the simplified high-granularity detector.This simplified simulation setup uses the same alternating layers of scintillator and lead plates (Shashlik technology), as in LHCb Electromagnetic calorimeter (ECAL) [3] and it consists of a matrix of 30x30 cells of size 20.2x20.2mm2 in ηφ plane.This allows us to emulate each type of current ECAL modules: inner, middle or outer with cells size of 40.4x40.4mm 2 , 60.6x60.6 mm 2 or 121.2x121.2mm 2 , respectively.
For each photon from the signal sample, we find the closest track in Geant4 simulated data. 1 The calorimeter cell, in which the signal produces the hit is required to be surrounded by two layers of cells of the same type.Thus, a matrix of 5x5 cells of the same type is obtained. 2We suppose that most of the clusters of the signal sample do not exceed the size of such a matrix.Inside this matrix, the cell with the highest energy deposit is searched.The barycentre of the cluster is the reconstructed position of the photon which released energy in the calorimeter.The dependence of each of the local coordinates of the signal cluster barycentre on the corresponding true coordinate of the hit position we call the S-curve due to its distinctive shape.The S-curves of both x and y coordinates for the inner region are displayed in Figure 1.The difference between the S-curve and the straight line characterises the quality of spatial reconstruction.Several approaches were tested to calibrate the S-curve (hit position reconstruction): the parametric approach and the machine learning approach using XGBoost regressor [9].The results of the calibration for these approaches are shown in Figure 2 and in Table 1.
As a metric of spatial resolution we use RMSE of the difference between true and reconstructed local coordinate (independent of x and y) of the hit.The observed difference in the metric, based on local coordinates x and y was found to be negligible.Therefore, all the results exploited in this metric are presented for local coordinate x.
For the parametric approach, the reconstructed local coordinate x is represented as a • arcsinh(b • x).The parameters for the calibration were found using random search [10]   with 1000 2 points in the range (0.01, 100) for each parameter.The best parameters obtained using a parametric approach the inner section are: a = 1.15, b = 2.07.
The selected machine learning approach was based on the extreme gradient boosting (XGBoost), and among its hyperparameters colsample_bytree, gamma, max_depth and min_child_weight were selected, which are typical for such a problem.These hyperparameters were optimised using BayesSearchCV3 , within the ranges (1, 20), (1, 10), (0.1, 0.9) and (0.3, 0.7), respectively.The chosen parameters colsample_bytree = 0.7, gamma = 0.1, max_depth = 20, min_child_weight = 10 (the set of values is for the inner section) provide best results without overtraining being observed.As a result, we used 5-fold cross-validation and trained the regressor using 30% events of the sample.
The procedure of test beam data analysis starts with the choosing of an initial sampling point.Afterwards, given a frequency k, each k-th point was sampled towards both lower and higher frequencies.The reference time was predicted using 5 different models.All the regressors have been tuned using Bayesian optimisation.Feature engineering for the selected regressors did not show any significant improvement in the results.Time difference RMSE was chosen as a loss function.The comparison of 5 selected models is shown in Figure 4. Different regressors demonstrate similar results.One can see that the sampling frequency can be reduced from 5 GHz to 250 MHz without considerable changes of time reconstruction.Subsequent results were obtained using the XGBoost regressor [9].

Overlapping signals
High pile-up conditions imply overlap of signals from different vertices.Timing information can be used to mitigate pile-up.Hereinafter, we consider signal processing at the readout for an individual calorimeter cell.
Since we didn't have samples with multiple signals, we generated it under given amplitude ratio and time shift.The resulted signal can be parameterised as where S 1 and S 2 are different signals obtained from the test beam data in the same way as described in Section 3.1.The reference time of the signal which amplitude is greater is assigned to the reference time of the resulting signal.Figure 5 (left) shows the generated signal.By aggregation of the arrival time difference (τ) and ratio (α) of two signals one can produce 2dimensional distributions over these variables as displayed in Figure 5 (right).It demonstrates the accuracy of our predictions on these two parameters for two different datasets.

Reference time prediction in the presence of the second signal
The final model for time difference/amplitude separation of two signals employs an ensemble of models of KNeighborsClassifier, DecisionTreeClassifier and RandomForestClassifier.5  The models were tuned using Bayesian optimisation6 separately for each sampling frequency.
A variety of strategies we tried for probabilistic class estimation, such as 'linear', 'harmonic', 'geometric' and 'rank averaging'.'Harmonic' resulted as the best one according to the crossvalidation.

Conclusion
The proposed approaches of spatial reconstruction and time of arrival properties of the LHCb electromagnetic calorimeter illustrate the idea how machine learning in the application of detector development allows to eliminate the selection of the parameters of the reconstruction while changing the main varied quantities of the detector.Thus, the detector R&D and its optimisation cycle as a whole is speeding up.The spatial reconstruction using both parametric and machine learning approaches is obtained.The performance of the selected XGBoost configuration surpass those of the parametric approach for each of calorimeter regions.For the pile-up mitigation, the machine learning approach demonstrates the ability to evaluate time reconstruction in the presence of the second signal.

Figure 1 .
Figure 1.The dependence (2-dimensional distribution) of local coordinates of signal cluster barycentre on true coordinates of the hit position for inner modules.The local coordinates are of x (left) and y (right).Here and in the Figure 2 colour from violet (dark) to yellow (bright) represents the normalised counts of the events from 0.0 to 1.0, respectively.

Figure 2 .
Figure 2. S-curve calibration results for inner modules and local coordinate x using parametric approach (left) and XGBoost regressor (right).

Figure 5 .
Figure 5. Generated and initial signals (left).The 2-dimensional distribution of arrival time difference and signal ratio of two signals for sampling frequency 1000 MHz (right).

Figure 6
Figure 6 demonstrates the ability of the final model to evaluate time reconstruction in the presence of the second signal.

Figure 6 .
Figure 6.The 2-dimensional distribution of arrival time difference and signal ratio of two signals for sampling frequency 1000 MHz.