Optimization of the input space for deep learning data analysis in HEP

Deep learning neural network technique is one of the most efficient and general approach of multivariate data analysis of the collider experiments. The important step of such analysis is the optimization of the input space for multivariate technique. In the article we propose the general recipe how to find the general set of low-level observables sensitive for the differences in the collider hard processes.


Introduction
Many high energy physics tasks in the collider experiments require modern efficient techniques to reach the desired sensitivity. The rather general scheme of HEP data analysis contains the distinguishing of some rare physics process from overwhelming background processes. Neural network technique (NN) is one of the most popular and efficient multivariate method to analyze multidimensional space of observables which helps to increase the sensitivity of the experiment. Possible optimizations of the set of high-level observables for multivariate analysis were considered previously and general recipe was formulated [1,2] based on the analysis of Feynman diagrams which contribute to signal and background processes. The novel approach of deep learning neural network (DNN) becomes more popular and more efficient in some cases [3]. The main advantage of DNN is the ability to operate with raw low-level unpreprocessed data and recognize the necessary features during the training stage. Unfortunately, the checks of naive implementation of DNN with low-level observables, such as four momenta of final particles, do not demonstrate desired efficiency. The matter of this article is to propose the general recipe to find the set of low-level observables for DNN analysis of the collider hard processes.

Check of the conception
Training of the NN usually means an approximation of some function. The classification tasks can be considered as an approximation of the multidimensional function which match the class of input vector to the desired output of NN for this class. What types of functions can be approximated in this manner? The question can be traced historically to the 13th mathematical problem formulated by David Hilbert [4]. It can be formulated in the following way: "can every continuous function of three variables be expressed as a composition of finitely many continuous functions of two variables?". The general answer has been given by Andrey Kolomogorov and Vladimir Arnold [5] in the Kolmogorov-Arnold representation theorem: "every multivariate continuous function can be represented as a superposition of continuous functions of one variable". Based on this theorem, one can conclude that the methods developed for NN training, potentially can approximate all continuous multivariate functions. In reality, if we consider the standard form of perceptron y i = σ( n j=1 w i j x j + θ i ) the only nonlinear part is the activation function σ(). For the most simple case, if we take very popular activation function ReLU (σ(x) = x for x > 0 and σ(x) = 0 for x < 0), the whole NN with many layers and perceptrons is the linear combination of the input variables. One need to understand what type of functions we consider in High energy physics to describe the properties of the collider hard processes and how to describe the input space in most general and efficient way. In the collider physics one formulate the properties based on four-momenta of the final particles. The reasonable way is to take all possible four-momenta of all final particles as the input vector for the DNN analysis and during the training DNN resolves the sensitive features. We can try to check the hypothesis and find the general low-level observables for the DNN implementation. For the DNN training we use Tensorflow [7] and Keras [8] software. For the criteria of the efficiency one can use more simple, but efficient Bayesian neural networks (BNN) [9,10] with only one hidden layer. The set of high-level input variables for BNN is very specific for the particular physics processes and is highly optimized based on the method mentioned above [2]. As an example of the physics task we consider distinguishing of the t-channel single-top-quark production from pair-top-quark production processes. The task is not trivial, but is already considered many times in the past [6]. The set of high-level input variables for the cross-check of the efficiency is the same as in the analysis of CMS collaboration [11]. The Monte-Carlo simulation has been performed in CompHEP package [12]. At the first step of the check we compare the efficiency of BNN and DNN with the same set of high-level input variables. The comparison is shown in the Fig. 1. The left plot demonstrates output of DNN and BNN for the signal and background processes. The right plot demonstrates ROC (Receiver Operating Characteristics) curve which is usually used to demonstrate the efficiency. The efficiency is higher if the Area Under the Curve (AUC) is higher.
In the Fig. 1 one can see the same efficiency of the BNN and DNN with one hidden layer, for the same set of high-level variables. We can conclude that both of the methods provide the same sensitivity for the same input vector. But, the DNN method is able to prepare very large networks with many hidden layers and analyze raw low-level features. At the second step of the cross-check we compare the same benchmark BNN with DNN trained on naive set of low-level variables with four-momenta of the final particles, which was mentioned above. The corresponding comparison is shown in Fig. 2. In the Fig. 2 one can see that DNN trained with complete set of four-momenta of final particles provides worse result than benchmark BNN trained with optimized high-level variables. Such behavior demonstrate that one need to understand deeper the function which has to be approximated and the raw input vector to describe the input space for DNN.

Formulation of the general recipe to form the input space for DNN analysis of HEP scattering processes.
The main properties of the collider hard process, e.g. differential cross sections, are proportional to the squared matrix element of the particular hard process. The concrete form of the matrix elements are different, but in all cases this is the function of scalar products of four-momenta or Mandelstam Lorentz-invariant variables. For example [13], the form of the squared matrix element for the simple s-channel single top production process ud → tb, in terms of scalar products of four-momenta (p u , p b , p d , p t ): or it can be rewritten in terms of Mandelstam variables using (p u p b ) = −t/2 , (p d p t ) = (M 2 t −t)/2, (p d p b ) = −û/2 , (p u p t ) = (M 2 t −û)/2 and M t is the top quark mass: From the textbooks one knows that for the 2 → n scattering processes there are 3n − 4 independent components and minimal set of observables can construct the complete basis. We would suggest that the correct approach is to take scalar products of four-momenta as the input space for the DNN analysis. The comparison of such approach with benchmark BNN is shown in the Fig. 3. The  demonstrates significantly worse efficiency of DNN trained with scalar products of four-momenta in comparison with benchmark BNN. The reason is simple in this case, matrix element depends on scalar products of four-momenta not only of the final particles, but also the four-momenta of the initial particles, which we can not reconstruct for the hadron colliders. For the mass-less particles the t = (p final − p initial ) 2 variables can be rewritten in the following form [2]:t i, f = − √ŝ e Y p f T e −|y f | , whereŝ is invariant mass of the final particles, Y = 1 2 ln( p+p z p−p z ) is pseudorapidity of the center mass of the system, P f T and y f are transverse momenta and pseudorapidity of the final particle. Therefore, we can try to add four-momenta of the final particles in additional to the scalar products of the final particles as the more complete basis to describe the input space for DNN analysis. The comparison is shown in Fig. 4 where benchmark BNN is compared with DNN trained on the set of scalar products and four-momenta of the final particles. The Fig. 4 demonstrates very similar performance of the DNN trained with rather general set of low-level variables and benchmark BNN trained with highly optimized set of high-level variables. At the last step of our comparison we can try to apply DNN in the space of low-and high-level variables and check is there some improvements in the sensitivity. Such comparison with benchmark BNN is shown in the Fig. 5. The Fig. 5 demonstrates almost the same DNN performance as in Fig. 4 for the DNN trained with general set of low-level variables, not a significant improvement in the first case can be associated with an approximate translation of the fourmomenta of the initial particles, where optimized set of high-level variables add more information in comparison with the general low-level set of variables.

Conclusion
One of the main reason of high efficiency of the deep learning neural network is the ability to operate with raw low-level information. The DNN technique resolves necessary high-level features during the training. The question is how to find the complete set of low-level observables to achieve optimal performance. This paper formulates the general recipe to construct the set of low-level observables for DNN analysis of the collider hard processes. The simple recommendation is to take the fourmomenta of the final particles and scalar-products of the four-momenta. Such combination provides the most general and efficient combination to distinguish the properties of the hard processes at hadron colliders. The scalar-products can be replaced with Mandelstam variables. The main discrepancy in the recipe is the result of the unknown four-momenta of the initial particles at hadron colliders, and some specific high-level variables can improve the performance in the particular cases. The efficiency of such general recipe has been demonstrated by the comparison with well investigated and highly optimized set of high-level variables constructed for the real data analysis in CMS experiment [11].