Machine Learning Techniques in the CMS Search for Higgs Decays to Dimuons

Abstract. With the accumulation of large collision datasets at a center-of-mass energy of 13 TeV, the LHC experiments can search for rare processes, where the extraction of signal events from the copious Standard Model backgrounds poses an enormous challenge. Multivariate techniques promise to achieve the best sensitivities by isolating events with higher signal-to-background ratios. Using the search for Higgs bosons decaying to two muons in the CMS experiment as an example, we describe the use of Boosted Decision Trees coupled with automated categorization for optimal event classification, bringing an increase in sensitivity equivalent to 50% more data.


Physics Motivation
Since the discovery of the Higgs boson in 2012 [1,2], focus has turned towards precisely measuring its properties. The Higgs couplings to the W and Z gauge bosons, as well as the third generation quarks (top and bottom) and tau leptons, have been observed by CMS and ATLAS, and appear to be consistent with the Standard Model (SM) predictions. Finding Higgs decays to a pair of muons with opposite charge (µµ) would provide the first evidence of Higgs couplings to fermions outside the third generation.
This search is made difficult by the minuscule branching fraction of Higgs decays to µµ, predicted to be ∼0.022% by the SM. In addition, there is a large irreducible background from Drell-Yan µµ production, in addition to top quark or W boson pairs decaying to muons. The Higgs signal has a distinctive dimuon invariant mass peak near 125 GeV, which is only ∼4 GeV wide, thanks to the excellent muon momentum measurement in CMS. Meanwhile, the invariant mass spectrum for background events falls smoothly in the search region from 110 to 150 GeV. The CMS detector is described in detail in [3]. The results of the 2016 analysis, using 35.9 fb −1 of collision data, were published by the CMS collaboration in [4]. In this paper, we describe in more detail how the analysis was optimized for maximum signal sensitivity, utilizing multivariate and machine learning techniques.  Drell-Yan decays to muon pairs make up more than 90% of the background, and most closely resemble gluon-fusion signal events, with no additional energetic particles emerging from the collision. The tt background and rarer processes with high-momentum jets must be distinguished from VBF signal events.

Signal Extraction
The amount of Higgs signal in the data is measured by performing a combined signal-plusbackground fit to the invariant dimuon mass spectrum in data, using a sum of three Gaussians to model the sharply peaked signal, and a modified Breit-Wigner curve to model the smoothly falling background, as shown in figures 2 and 3. However, with an initial signal-tobackground (S/B) ratio of ∼0.3% even at the mass point of 125 GeV, an immense amount of data is needed to confirm the presence of signal events. The CMS analysis of 7 and 8 TeV data collected in 2012 [5] divided events into categories based on the transverse momentum (p T ) of the dimuon pair (which is higher for gluon-fusion signal than for Drell-Yan background), or the presence of a high-invariant-mass dijet pair, characteristic of VBF signal events. It also sub-divided the gluon-fusion categories based on the muon pseudorapidity (η), as central muons have better p T resolution, resulting in a sharper signal mass peak. Performing separate signal-plus-background fits in all of these categories and combining the results significantly increased the search sensitivity relative to a measurement of all candidate dimuon events together.

Event Classification
For the analysis of 13 TeV data collected by CMS in 2016, we developed a new event classification based on a larger set of input variables fed into a Boosted Decision Tree (BDT), implemented in the TMVA class [6] of the ROOT analysis package [7]. A binary signalbackground separation is computed, yielding a BDT score between -1 and 1, where events close to 1 are more signal-like, and events close to -1 are more background-like.
The signal training set includes the three main production channels: gluon-fusion, VBF, and VH. Variables with some discriminating power for signal-background separation are C predictions, weigthed sum of the contribution from d for one of the best mass resolution, category 6, (right) background primarily follows the smoothly decreasing on. A secondary contribution is induced by the single ich have flatter profiles. Several analytic functions were . The first set includes generic series, such as a sum of polynomials, which involve no prior assumption about includes modified versions of the Breit-Wigner Z-peak fitting FEWZ predictions of the DY invariant mass dismarized in Equations 1-4. In addition, FEWZ spectra nctions are considered. used, but in all cases the difference between the inclusive signal and background shapes is relatively small. The invariant mass of the dimuon system is excluded, as it will be used independently to measure the amount of signal in each BDT-score-defined category. The kinematic variables selected as input to the BDT are as follows: • The p T and η of the dimuon system • The number of jets identified as coming from b-hadrons [8] • The missing transverse energy E T .
The dimuon p T and η are most important, as gluon-fusion signal pairs tend to be more central and have higher p T than the Drell-Yan background. The dijet separation in η and invariant mass are crucial for clearly identifying the small fraction of VBF signal events. category. From each of these fits, thousands of pseudo-da the uncertainty on the fit parameters. Each of the func datasets generated from the other functions, with the me the fit. The bias is computed as the measured excess or with the smallest bias fitting the pseudo-data from all th is chosen. The maximum possible bias in all categories fo than 20% of the statistical uncertainty, corresponding to an limit of < 1%, which can be neglected.
The systematic uncertainties considered in the analysis shape, rate, or category migrations of the signal model. T Events with b-tagged jets and significant missing E T most likely come from tt background events and are assigned a low score. The training is based on one million simulated events for the various channels, fully reconstructed in the CMS detector. The signal sample is split into three independent sets: one for training, a second for testing, and a third completely independent -to avoid any bias -for the final measurement. The background samples are typically split in 75% for training and 25% for testing. The final measurement does not use simulated background, but rather a direct analytic fit to the data in each category.
The BDT uses 400 trees, gradient boost, and variable splitting at 1000. The receiver operating characteristics (ROC) integral below the curve for signal-background separation is 0.72. The BDT response, transformed in quantiles with a uniform distribution in the sum of expected signal events from all production modes, is shown in figure 4. The VBF signal events have the highest BDT scores, with the best signal-to-background ratio. Gluon fusion events have generally higher scores than the dominant Drell-Yan background, and tt events congregate at the lowest end of the spectrum.
Because the final signal measurement is made using a fit to the dimuon invariant mass spectrum, it is important that high BDT scores are not correlated to signal-like mass values. If such a correlation existed, real background events in the highest BDT bins would be biased towards 125 GeV, and could mimic an excess of signal events even if no true signal events were present. To confirm that such a bias does not exist, we evaluate the BDT on simulated signal events generated with Higgs boson masses of 120, 125, and 130 GeV. As the BDT output distribution looks identical for these three samples, we confirm that the BDT has not   1-4 are used to fit the data in each tasets are created, taking into account tions is then used to fit the pseudoasured signal yield floating freely in deficit signal yield, and the function e other functions in a given category r m H = 120, 125, and 130 GeV is less overall uncertainty on the calculated account for possible mismodeling in he shape of the reconstructed Higgs "learned" the true signal mass, at least to a resolution less than 10 GeV, which is much larger than the 4 GeV dimuon mass resolution at 125 GeV.

Decision Tree Auto-Categorizer
Once the BDT scores for each event are obtained, further improvement in the signal-tobackground separation is possible by taking into account the fact that more central events in the CMS barrel have a better mass resolution than forward events where at least one of the muons is in one of the two CMS endcaps. The basic idea is to "greedily" optimize the sensitivity by simultaneously categorizing events based on the BDT score (from -1 to 1) and the maximum muon |η|, which is directly correlated to the mass resolution: 2.8 to 7.6 GeV full width at half-maximum (FWHM) for |η| from 0 to 2.4. To compute the expected measurement sensitivity, the simulated signal (S) and background events (B) are divided in 0.5 GeV bins in µµ mass for the region from 120 to 130 GeV. The expected signal significance from each bin is given by S / √ B, and the values from each bin are added in quadrature: where the sum runs over the categories C and the mass bins i. The automated categorization procedure is performed in steps. We start with one (inclusive) category C0 with fine mass binning, and check all possible binary cut values on muon |η| or event BDT score to find the value giving maximum gain for a split C0→C1+C2: In following iterations, we repeat the procedure on the new set of categories, "greedily" going for the maximal gains by splitting one category at a time. We stop the procedure when the gain from one additional cut is no longer significant (∼ 1%).

Final Categories
At the end of the auto-categorizer procedure, a simplification of the cut boundaries in |η| and BDT scores is performed by rounding some of the cuts. We have checked that no sizable loss of sensitivity is introduced when using the simplified boundaries shown in figure 5. In this way we arrive at 15 event categories. The relative gain in sensitivity is 23% compared to the simple categories based on µµ p T , dijet mass, and muon η used in the CMS searches at 7 and 8 TeV, equivalent to a dataset 50% larger than the one actually collected.  Figure 6.3: The final categorization Table 6.1: The optimized event categories, the product of acceptance and selection efficiency (Ae) in % for the different production processes, the total expected number of SM signal events (m H = 125 GeV), the estimated number of background events per GeV at 125 GeV, the FWHM of the signal peak, the background functional fit form, and the S/ p B ratio within the FWHM of the signal shape. BDT  The expected signal and estimated background yields for the final categories, ordered from least to most sensitive, are shown in table 2. The S / √ B ratio ranges from 0.12 for category 0 to 0.48 for the most sensitive category 14 with the most signal-like BDT scores.

Conclusions and Future Work
The BDT-based categorization helps to enhance the separation power of signal versus background for the difficult search of the Higgs boson decaying to two muons. Using the full data set collected from 2016 to 2018 in Run 2 of the LHC, the experiments will approach the discovery zone for this important decay to the second generation fermions, with an expected significance around 2 standard deviations from the no-signal hypothesis. In order to further increase the sensitivity, deep neural networks with several layers and multi-node architectures are being explored to improve on the already impressive performance of the BDT  techniques. Additional promising avenues include the use of extended sets of discriminating variables, especially those targeting the rare VH and ttH production modes. Along with the multivariate discriminators, a more sophisticated automated categorization is being considered, accounting explicitly for the fit function uncertainty in the background estimate in each category. Taken together, the combined BDT plus auto-categorization approach serves as a model for how to achieve the maximum sensitivity to some of the rarest events in nature.