Machine learning techniques for jet flavour identification at CMS

Jet flavour identification is a fundamental component of the physics program of the LHC-based experiments. The presence of multiple flavours to be identified leads to a multiclass classification problem. We present results from a realistic simulation of the CMS detector, one of two multi-purpose detectors at the LHC, and the respective performance measured on data. Our tagger, named DeepJet, relies heavily on applying convolutions on lower level physics objects, like individual particles. This approach allows the usage of an unprecedented amount of information with respect to what is found in the literature. DeepJet stands out as the first proposal that can be applied to multi-classification for all jet flavours. We demonstrate significant improvements by the new approach on the classification capabilities of the CMS experiment in simulation in several of the tested classes. At high momentum improvements of nearly 90% less false positives at a standard operation point are reached.


Introduction
Jet flavour identification has been a longstanding staple of the physics program of HEP experiments. The problem has always been approached as a supervised learning classification problem, leveraging the peculiar features of heavy flavour hadrons embedded inside the parton shower. Heavy-flavour hadrons, which carry a charm or beauty quarks, have a significant lifetime and produce displaced tracks and secondary vertices (SV) within the clustered jet. Additionally, they have a significant semi-leptonic and leptonic branching fraction. It is therefore possible to use this information as well in the classification, even though it is rarely done in practice within the CMS collaboration as such information is used to obtain a data sample enhanced in heavy flavour content to measure the tagger efficiency in real data.

AK4 Jets
The current default jet flavour classifier in CMS is DeepCSV. DeepCSV was firstly introduced in Ref. [1] and consist of a dense deep neural network taking as input 8 features for the six most displaced tracks in the jet, 8 features from the most displaced secondary vertex, and 12 global variables, totalling 68 input features. Missing features are zero-padded. Tracks and SVs undergo a pre-selection before the feature extraction to reject fake and pile-up tracks and nuclear interaction vertices. The 68 input features are then fed into five layers with 100 In this figure, the tagging efficiency is integrated over the pT and h distributions of the jets in the tt sample. The tagging efficiency is also shown for the Run 1 version of the CSV algorithm. It should be noted that the CSV algorithm was trained on simulated multijet events at centre-of-mass energy of 7 TeV using anti-kT jets clustered with a distance parameter R = 0.5. Therefore, the comparison is not completely fair. The performance improvement expected from a retraining is typically of the order of 1%. The absolute improvement in the b jet identification efficiency for the CSVv2 (AVR) algorithm with respect to the CSV algorithm is of the order of 2-4% when the comparison is made at the same misidentification probability value for lightflavour jets. An additional improvement of the order of 1-2% is seen when using IVF vertices instead of AVR vertices in the CSVv2 algorithm. The cMVAv2 tagger performs around 3-4% better than the CSVv2 algorithm for the same misidentification probability for light-flavour jets. The DeepCSV P(b) + P(bb) tagger outperforms all the other b jet identification algob jet efficiency rithms, when discriminating against c jets or light-flavour jets, except for b jet identification efficiencies above 70% where the cMVAv2 tagger performs better when discriminating against light-flavour jets. The absolute b identification efficiency improves by about 4% with respect to the CSVv2 algorithm for a misidentification probability for light-flavour jets of 1%. Three standard working points are defined for each b tagging algorithm using jets with pT > 30 GeV in simulated multijet events with 80 <pT < 120 GeV. The average jet pT in this sample of events is about 75 GeV. These working points, "loose" (L), "medium" (M), and "tight" (T), correspond to thresholds on the discriminator after which the misidentification probability is around 10%, 1%, and 0.1%, respectively, for light-flavour jets. The efficiency for correctly identifying b jets in simulated tt events for each of the three working points of the various taggers is summarized in Table 2.
The tagging efficiency depends on the jet pT, h, and the number of pileup interactions in the event. This dependency is illustrated for the DeepCSV P(b) + P(bb) tagger in Fig. 17 using nodes each with ReLU activation, and an output layer with SoftMax activation discriminating between four output classes: b, bb (two B-hadrons in the jet), c and light (comprising both quarks and gluons). The model has been trained with the Keras [2] python package using the TensorFlow [3] backend. DeepCSV outperforms all the previous CMS taggers including cMVAv2, which uses the additional lepton information, as shown in Figure 1.
The big performance gain obtained moving to a deep neural network sparked the interest in more complex models. Convolutional neural networks have been successfully used for image classification and there have been several attempts to use such approach for jet classification. While this approach can be successful for boosted objects, which is mainly focusing on the internal energy distribution of the jet, the flavour identification relies on more features and on quantities that cannot be easily summed in a discretised environment.
The DeepJet algorithm [4,5] focuses on single particles rather than on images. No preselection is applied to any track, secondary vertex, or neutral candidate before entering the network. The network uses 16 features of up to 25 input tracks (displacement sorted), 8 features of up to 25 neutral candidates, 12 features of up to 4 secondary vertices, and 6 global variables. Each candidate type is then passed through a set of convolutional layers that operate on each single candidate separately. These layers provide an automated form of feature selection and engineering, resulting in 8, 4, and 8 features for each input track, neutral candidate, and secondary vertex, respectively. These layers are then masked for missing inputs and passed to three independent LSTMs, which learn a compact summary of each candidate type. The output dimensionality of these recurrent layers is 150, 50, and 50 for tracks, neutral candidates, and secondary vertices, respectively. Finally, these outputs are combined together with the global variables and fed into seven dense layers with 100 nodes, except the first layer which has 200 nodes. A final output layer provides discrimination between six classes: three b jet types (one B hadron, two B hadrons, one leptonically decaying B hadron), charm jet, light quark jet, and gluon jet. Each node in the network has a RelU activation function, except the output layer which has SoftMax activation. A schematic of the DeepJet network structure can be found in Figure 2.
DeepJet has shown improved performance in b-jet classification, especially at high jet p T , as shown in Figure 3. The multi-classification approach of DeepJet allows to compare the performance of the same model also for quark/gluon discrimination, showing comparable performance to dedicated binary classifiers, as shown in Figure 4. Performance of the DeepJet multi classification algorithm, the recurrent and the convolutional approach, demonstrating the probability for gluon jets to be misidentified as a light quark (uds) jet, as a function of the efficiency to correctly identify light quark jets. The curves are obtained on simulated QCD events with p̂T between 30 and 50 GeV and using jets with a p T above 30 GeV. The absolute performance in this figure serves as an illustration since the light quark jet identification efficiency depends on the p T and η distribution of the jets, the event topology, the flavour composition of the sample, and the generator used. All curves are obtained using Pythia8. Jets that originate from a gluon splitting to cc or bb quarks are not considered gluon jets.
! 14 Figure 4. Performance of the DeepJet algorithm in the quark/gluon classification task. The algorithm is compared to other two binary approaches exploiting the jet energy deposits in image format (convolutional) and exploiting the single jet constituent kinematics in a list (recurrent).

Boosted resonances, AK8 Jets
A similar architecture of DeepJet is employed by the DeepDoubleB and DeepDoubleC taggers [6], although neutral candidates are ignored and the number of tracks and SVs is limited by a preselection. These classifiers aim at identifying the decay of a boosted resonance into a pair of b and c jets, respectively. As previously mentioned, the network structure, summarised in Figure 5, is very similar to the one of DeepJet, but in this case GRUs are used in the recurrent units instead of LSTMs. The DeepDoubleB/C taggers are trained as binary taggers aiming at rejecting the QCD background (DeepDoubleBvL and DeepDoubleCvL)   and at separating boosted double b decays from double charm jets (DeepDoubleCvB). The performance of these new taggers is shown in Figure 6 and 7. DeepDoubleB significantly outperforms the previous double-b classifier, but also introduces a significant mass sculpting, as shown in Figure 6, which is undesirable for physics analyses. To overcome this issue, two penalty terms proportional to the Kullback-Leibler divergence between the original background and signal mass distributions and the classifier output-weighted distributions are applied per batch. These term ensures that the classifier output is decorrelated from the jet mass. The mass decorrelation comes at negligible cost in classification performance, as shown in Figure 8.
The DeepAK8 [7] tagger applies no preselection to the jets constituents and up to a hundred of them is used. The large amount of candidates makes computationally unfeasible the training of recurrent units. Therefore, a set of convolutional kernels spanning multiple candidates is used instead. Ten features for each charge and neutral candidate are passed to one of these convolutional blocks ordered in candidate p T to learn the jet sub-strucure. The flavour content of the jet is learned by other two of these blocks, one using only charged jet constituents, sorted by displacement, and one using secondary vertices, sorted by flight distance.   These three convolutional nodes are then merged into a single dense layer with 521 nodes before reaching the output layer. A schematic of the DeepAK8 architecture can be found in Figure 9. DeepAK8 aims at classifying a wide variety of resonances in multiple decay modes. DeepAK8 outperforms a simpler BDT approach in the task of top classification as shown in Figure 10.  Comparison of the performance of the two BDT taggers and the two particle-based CNN taggers in terms of ROC curves in MC simulated events for top jets as signal and QCD jets as background. The events correspond to AK8 jets with 1000<p T <1400 GeV and |η|<1.5. 5 Figure 10. Boosted top classification performance of the DeepAK8 algorithm compared to the performance of a BDT and a deep neural network based only on kinematic information of the jet constituents and a BDT based on kinematic information and displacement of the jet constituents.

Model Deployment in Production Environment
The deployment of these new deep-learning-based models in the production environment of a large HEP experiment poses a significant challenge. The training environment is usually python-based with a custom set of libraries installed and running with minimal memory or CPU usage restrictions. The production environment, instead, is usually based on a custom C++ framework, with tight constraints on memory and running on multiple threads. The harmonisation of the threading pool implementation in TensorFlow and in CMSSW and the optimisation of the model for inference have been the two major issues the collaboration had to overcome before fully integrating a TensorFlow inference engine into our production 6 EPJ Web of Conferences 214, 06010 (2019) https://doi.org/10.1051/epjconf/201921406010 CHEP 2018 workflow. Further reduction of the memory footprint and the number of external dependencies may come from model pruning or AOT compilation of the trained model.

Conclusions
The latest developments in heavy flavour classification from the CMS Collaboration have been reviewed in this contribution. Switching to a deep learning approach have brought significant improvements in this field together with new challenges in deploying these models in the production environment of a large HEP experiment.