Binary classifier metrics for optimizing HEP event selection

I discuss the choice of evaluation metrics for binary classifiers in High Energy Physics (HEP) event selection and I point out that the Area Under the ROC Curve (AUC) is of limited relevance in this context, after discussing its use in other domains. I propose new metrics based on Fisher information, which can be used for both the evaluation and training of HEP event selection algorithms in statistically limited measurements of a parameter.


ROC curves in Medical Diagnostics and Machine Learning
ROCs are the standard tool in Medical Diagnostics. Using HEP terminology (due to space constraints), they are plots of signal efficiency ǫ s = S sel /(S sel +S rej ) as a function of background efficiency ǫ b = B sel /(B sel + B rej ). The ROC was originally introduced for radar applications of signal detection theory in the 1940's. Its use was then extended to psychophysics, i.e. visual and acoustic perception by human observers, and from there to medical imaging and medical diagnostics. It later became a standard tool also in Machine Learning (ML) research [1].
The ROC curve connects the lower-left corner where (ǫ b , ǫ s ) = (0, 0) to the top-right corner (ǫ b , ǫ s ) = (1, 1). One of its obvious features is that it is independent of disease prevalence, i.e. of the ratio π s = S tot /(S tot + B tot ) between the total number of signal events S tot = S sel +S rej and that of signal plus background events N tot = S tot +B tot (where B tot = B sel +B rej ) before selection.
The ROC also possesses another remarkable feature, namely the surface below it is convex, i.e. its slope ℓ = dǫ s /dǫ b is monotonically decreasing. This is because ℓ is the likelihood ratio between the signal and background hypotheses for the event, and by construction the score D is a monotonically increasing function of ℓ. Experimental ROCs derived from validation data sets are actually not convex, as they are made up of a sequence of horizontal and vertical steps, but they can be transformed into convex ROCs by considering their "convex hull". The fact that ROCs are convex is very important if the optimal operating point is chosen using the classic cost-benefit analysis described in MD and ML research, where constant benefits V TP and V TN are attributed to each correct decision (TP and TN) and constants costs K FP and K FN to each mistake (FP and FN): the maximum value of the average benefit over all N tot decision is, in fact, the point where the slope dǫ s /dǫ b equals (1-π s )/π s (V TN +K FP )/(V TP +K FN ), which allows an easy geometrical interpretation of the points on the ROC plane in terms of "iso-benefit" lines, once the prevalence π s and the cost-benefit matrix are known [1].
The main reason for the popularity of ROCs in MD (and ML), however, is not their convex shape, but rather the fact that ROCs describe diagnostic accuracy independently of disease prevalence π s and of the choice of an operating point. This is important because physicians may choose a threshold D thr using different cost-benefit criteria, and because prevalence may be unknown at the time of diagnosis. The Area Under the ROC Curve (AUC), in particular, emerged over time in both MD and ML as a better measure of accuracy than another popular metric, the overall fraction of correct decisions ACC = (TP+TN)/N tot = (S sel + B rej )/N tot , for precisely these reasons. While ACC strongly depends on π s and on the choice of an operating point, the AUC is independent of π s and describes all possible operating points on the ROC. Another advantage of the AUC in MD is that it has a well defined and relevant interpretation, namely it can be interpreted as "the probability that a randomly chosen diseased subject is correctly ranked with greater suspicion than a randomly chosen non-diseased subject" [2].
In spite of their popularity, however, ROCs and AUCs have known limitations. The main problem with the AUC, both in medical diagnostics and in concrete applications of ML techniques in banking and commerce, is that it is misleading when comparing two ROCs that cross: the AUC is built as an integral that gives equal weights to all parts of the ROC curves, but in the end a binary classifier is generally used by choosing a specific operating point, and in this case what counts is which ROC provides the better performance in the region where the operating point is chosen [3]. Another issue that was recently pointed out in ML research is the fact that ROC analysis may not be appropriate and may lead to overly optimistic evaluations in problems involving highly imbalanced data sets, i.e. where prevalence is very low or very high, and PRC curves may give a more informative picture in this case [4].

PRC curves in Information Retrieval
PRC curves are the standard tool in Information Retrieval. They are plots of signal purity, or "precision", ρ = S sel /(S sel + B sel ) as a function of signal efficiency ǫ s , or "recall", the two metrics that have essentially been used since the beginning of research in quantitative IR evaluation. ROC analysis was also suggested and thoroughly investigated as an alternative approach to measure IR effectiveness, but never found general acceptance in the field [5].
A popular scalar metric in IR for unranked evaluation, i.e. one which can be computed from the confusion matrix alone at any given point on the PRC, is F 1 = 2ǫ s ρ/(ǫ s +ρ). Traditional scalar metrics for ranked evaluation, which integrate the knowledge of the whole PRC, include Mean Average Precision (which is related to the area under the PRC or AUCPR), "precision at κ" and precision at a fixed recall ǫ s . These two last metrics are popular because, while in MD choosing an operating point is a compromise between the benefit of correct diagnoses of both ill and healthy patients (TPs and TNs) and the cost of mistakes (FPs and FNs), in IR users typically decide upfront that they can only afford the effort to retrieve κ = N sel documents, or that they need to retrieve at least a fraction ǫ s of all existing relevant documents.
min(1,log 2 i) , is a more recent but already popular metric describing users' gain after retrieving κ documents, taking into account for each document its non-binary true relevance grade G[i] and the order in which it was retrieved [6].

HEP event selection evaluation: a comparison to MD and IR
While "standard" metrics coming from other domains, like the AUC, are frequently used also for evaluating binary classifiers in HEP event selections, the main idea of this paper is that different metrics must be chosen for different problems that have different specific goals.
Event selection is used in HEP in various contexts. During data taking, trigger decisions maximize the volume of useful data that can be stored with limited computing resources [7]: this implies maximising signal purity ρ = S sel /N sel while the total numbers of input and output events per unit time, N tot and N sel , are fixed (this is analogous to "precision at κ" in IR). During data analysis, event selections are optimized to minimize errors in parameter measurements and maximize exclusion and discovery potentials of hypothesis tests for new physics models. While parameter measurements and searches require different metrics, I give here some general comments that apply to both, comparing HEP specificities to those of IR and MD in various respects. In Sec. 3, I will then focus on statistically limited parameter measurements.
• Qualitative imbalance. In MD, healthy and ill patients are both relevant: a diagnostic test is evaluated on its ability to detect a disease in ill patients (TPs) and to rule it out in healthy patients (TNs) [8]. IR is based instead on the idea that a whole class of documents are truly "irrelevant". Irrelevant documents in IR, like background events in HEP, are just a nuisance: what counts is the number of those incorrectly selected (FPs), while the number of those rejected (TNs) is irrelevant. In my opinion, this is the main reason why classifiers are mainly evaluated using ǫ s and ǫ b in MD and using ǫ s and ρ in IR and HEP. Note also that the ACC metric, popular in MD, is invariant under an inversion of positive and negative labels, while ǫ s , ρ and F 1 , popular in IR, are invariant under a change of the TN count [9]. • Quantitative imbalance. In MD, relatively balanced class distributions are the norm, even if high skews do exist, e.g. for rare diseases. Data sets with 1 malignant mammography in 150 have been considered as "highly-skewed" (and PRCs have been suggested as more appropriate than ROCs to analyze them) [4]. In IR, instead, it is normal for data sets to be extremely skewed, with less than 1 relevant document in 1000. HEP problems are even more imbalanced, e.g. only 1 in ∼10 9 LHC events (before any preselection) is a Higgs. • Cost/benefit models. Traditional ROC analysis in MD and ML assigns a value to each decision and maximises its average (TP·V TP −FN·K FN +TN·V TN −FP·K FP )/(TP+FN+TN+FP). This model is not used in IR and HEP classifiers, whose performance does not depend on TN. • Non-binary categorization. A strictly binary categorization as signal or background is used in most IR, MD and ML applications. Only a few metrics use non dichotomous evaluation, such as the Obuchowski measure in MD, example-dependent cost-sensitive classification in ML and graded relevance assessment (e.g. DCG) in IR [6,10]. In HEP fits of a parameter θ, each signal event i may have a different sensitivity γ i to it; this is discussed in Sec. 3. • Ranking and partitioning. A very specific feature of HEP, compared to MD and IR, is the routine use of distribution fits. As discussed in Sec. 3, these can be evaluated using metrics built as sums over different event partitions. Metrics based on the classifier's ranking and computed as sums over event partitions also exist in MD (AUC) and IR (AUCPR and DCG), but these are not appropriate to evaluate the performance of HEP distribution fits. • Scale invariance. Another specific feature of HEP is the need to use different analysis tools and evaluation metrics depending on the scale S tot of the problem. Searches for new physics operate in a regime where S tot is very small and are often optimized by maximizing a metric like P = ǫ s /( 3 2 + √ B sel ) [11]. With enough events S tot , measurements of model parameters of signal processes become possible; the statistical errors scale as 1/ √ S tot and can be minimized in terms of ǫ s and ρ alone, as discussed in Sec. 3. As S tot grows further, systematic errors become important and can no longer be ignored in classifier evaluation.
In summary, HEP event selection is more similar to IR than to MD, and it is not surprising that ρ and ǫ s are commonly used for its evaluation. The irrelevance of the TN count for both searches and parameter measurements is by itself a signal that any metric that depends on TN, like the AUC [9], is of limited relevance. As in MD, in particular, the main limitation of the AUC is that it is misleading when comparing two classifiers whose ROC curves cross. HEP event selection, however, has also two specific features that clearly set it apart from other domains, namely different event-by-event sensitivities and the routine use of distribution fits: this implies the need to develop HEP-specific evaluation metrics, as discussed in Sec. 3.

Statistical errors in parameter estimation and Fisher information
In this paper, I focus on HEP measurements of a single parameter θ, where systematic errors are negligible with respect to the statistical error ∆θ. I propose that the evaluation and training of event selection classifiers in this case should be based on the minimization of ∆θ, or equivalently on the maximization of the Fisher information I θ = 1/∆θ 2 about θ [12]. I assume that the signal event distribution depends on θ, while the background distribution does not.
A simple example is the measurement of a total cross section σ s in a counting experiment, where σ (meas) s = (M−B sel )/Lǫ s is derived from the observed count of selected events M = N (meas) sel and the measured luminosity L, while the expected number of selected background events B sel and the signal efficiency ǫ s are derived from MC simulations. Assuming ǫ s is independent of σ s , the expected statistical error ∆σ s is ∆N sel /Lǫ s , where ∆M = ∆N sel = √ N sel is the square root of the expected count of selected events N sel = S sel +B sel = S sel /ρ = Lǫ s σ s /ρ. This leads to To minimize ∆σ s , one should then maximise the product ǫ s ρ. This is a well known strategy to optimize event selections in total cross section measurements by counting experiments [13]. Note that the knowledge of σ s from previous measurements is needed to compute ρ. If σ s depends on another parameter θ, the counting experiment may be used to determine θ with a statistical error ∆θ given by ∂θ . This leads to 1 (∆θ) 2 = ( 1 S tot ∂S tot ∂θ ) 2 ǫ s ρS tot , where (1/S tot )(∂S tot /∂θ) = (1/σ s )(∂σ s /∂θ) indicates the sensitivity of the total cross section σ s to θ. This method was used for instance to determine the W mass at LEP2 by measuring the WW cross section at threshold [13].
The measurement of a total cross section σ s or of a generic parameter θ, however, is most often performed in HEP using a distribution fit, rather than a counting experiment. I consider here the common case of a maximum likelihood fit on a binned histogram where the data are Poisson distributed. Binned fits rely on splitting all events into K disjoint partitions, or "bins", according to the value(s) of one (or more) variable(s) describing the properties of each event. A binned fit can be considered as the combination of K different measurements performed on these disjoint data sets, where θ is determined in each bin from the observed count of selected events m k = n (meas) sel,k , with an error (∆θ) k derived from ∆n sel,k = ∂n sel,k ∂θ (∆θ) k = √ n sel,k , the square root of the expected number of selected events n sel,k . For statistically limited fits, these K 4 measurements are independent and a fortiori uncorrelated: the information about θ from the fit is thus equal to the sum of the individual contributions of the K bins, where s tot,k is the expected signal event count before selection,ǫ k = s sel,k /s tot,k the local signal efficiency (assumed independent of θ) andρ k = s sel,k /(s sel,k + b sel,k ) the local signal purity in bin k, andφ k = (1/s tot,k )(∂s tot,k /∂θ) represents the local sensitivity of the signal distribution in bin k to the value of θ. The expression for I θ in terms of n sel,k and ∂n sel,k /∂θ in Eq. 2 can also be more formally derived as the variance of the Fisher score ∂ log P(m, θ)/∂θ from the joint probability P(m, θ) = K k=1 e −n sel,k n m k sel,k /m k for the observation of m = {m 1 , . . . m K }. The optimal classifier is thus one whose local efficienciesǫ k and puritiesρ k maximize I θ in Eq. 2. Note that the knowledge of θ from previous measurements is needed to computeρ k . A dimensionless scalar metric between 0 and 1, however, is a more practical tool to evaluate classifiers. In particular, I suggest to use the ratio of I θ to the information I (ideal) θ which could be attained with an "ideal" classifier, achievingǫ k = 1 andρ k = 1 in every bin k, I named this metric FIP for Fisher Information Part, because it represents the fraction of theoretically available information that a classifier is able to retain. Equation 3 is interesting for several reasons: first, the classifier is evaluated in each bin in terms of signal purity and efficiency, while the number of True Negatives (rejected background events) and background efficiency are irrelevant; second, the knowledge of local purity and efficiency in each bin is needed to evaluate the classifier, while the global purity and efficiency for the overall selection are not enough; third, different signal events contribute in different ways to the evaluation of the classifier depending onφ k , the local sensitivity to θ of the bin k they belong to. The expression in Eq. 3 allows classifier evaluation for a wide range of statistically limited binned fits of a parameter, but it can be simplified in many special cases. To start with, measuring σ s or θ in a counting experiment is equivalent to a fit with a single bin. As s tot,k andφ 2 k for k = 1 cancel out in the ratio, the FIP metric in this case reduces to FIP 1 = ǫ s ρ, which is simply the product of global purity and efficiency, previously discussed for Eq. 1.

FIP metric for a total cross section fit from the score distribution: FIP2
Another very special class of measurements includes binned fits of a parameter θ where the classifier score D is used as a partitioning variable. I consider the common case where all events are used andǫ k = 1 in every bin k, because the classifier is used to partition events, rather than select them or reject them. It is thus no longer necessary to distinguish between total and selected event counts in each bin and I will use simpler symbols with no suffixes for signal (s k = s sel,k = s tot,k ), background (b k = b sel,k = b tot,k ) and their sum (n k = n sel,k = n tot,k ).
An important subcase, which is also an extremely common practice in HEP [14], is the fit of a total signal cross-section σ s from the one-dimensional (1-D) distribution of the score D. This has two important consequences. First, as long as a change in σ s can be considered as a global rescaling of the D distribution for signal events, the sensitivity of this distribution to θ = σ s is a constant across bins,φ k = 1 s k ∂s k ∂σ s = 1/σ s , and the FIP metric in Eq. 3 reduces to Second, fitting the 1-D distribution of D implies that every bin k in the fit effectively corresponds to a separate arc of the ROC curve, subtending two ranges dǫ s and dǫ b of signal and background efficiency, such that s k = S tot dǫ s and b k = B tot dǫ b . This implies that the local purity can be computed asρ(ǫ s ) = S tot dǫ s /(S tot dǫ s + B tot dǫ b ) = 1/[1 + (dǫ b /dǫ s )(1−π s )/π s ]. FIP 2 can then be computed from the ROC alone, as long as prevalence π s is also known: Some numerical tests [15] that I performed using a Python implementation of Eq. 5 confirmed that FIP 2 correctly predicts the average statistical error ∆σ s on a binned fit of σ s from the distribution of the score D. In my implementation, I computed the integral in Eq. 5 not on the native ROC obtained from a validation sample, but rather on its convex hull: a stepwise ROC, in fact, leads to an overestimate of FIP 2 because signal events contribute with an overestimated purityρ = 1 in the vertical steps where dǫ s > 0 and dǫ b = 0, while the negative impact of background events is underestimated because they are confined to the horizontal steps where dǫ s = 0 and dǫ b > 0. It is in any case interesting to note that, given two convex ROCs, one of the two is guaranteed to have a higher FIP 2 value than the other if it has a higher ǫ s for any value of ǫ b ; this can be easily proved using numerical and geometrical arguments.
Rather than from the ROC and prevalence, FIP 2 can also be derived from the PRC alone, as FIP 2 = 1 0 ρ dǫ s 1−(ǫ s /ρ)(dρ/dǫ s ) , but this is less practical than Eq. 5, because both calculations only make sense on the ROC convex hull, which is easier to compute on the ROC than on the PRC. Note also that these expressions for FIP 2 are integrals over the ROC and PRC, just like AUC and AUCPR. Unlike AUC, however, FIP 2 is qualitatively and quantitatively relevant for minimizing ∆σ s in a fit for σ s , while AUC is irrelevant and misleading, amongst other reasons because it is independent of the prevalence π s ; it is easy to prepare an example [15] to show that, if the maximum AUC is used as the criterion to choose one of two classifiers, the worse classifier, leading to the larger error ∆σ s , is selected for some specific values of π s . Note also that FIP 2 = ρdǫ s is an integral of local purity, while AUCPR= ρdǫ s uses global purity.

Optimal partitioning and Fisher information
Maximizing the information in Eq. 2 for a binned fit of θ involves two distinct but related steps: choosing the optimal partitioning of events into K bins, i.e. choosing the distributions to fit and the bin size, and choosing the classifier providing the optimal efficienciesǫ k and puritiesρ k in those bins. As discussed in Sec. 3, I consider the case where the score D is (one of) the partitioning variable(s), so that all events are used andǫ k = 1 in each bin k.
The relevant question to ask for optimizing partitioning is under which circumstances there is a benefit in splitting the events in one bin k = 0 into two separate bins k = 1, 2, so that n 0 = n 1 +n 2 and s 0 = s 1 +s 2 . From Eq. 2, the information "inflow" [12] in this split is given by Equation 6 clearly shows that the optimal partitioning strategy consists in separating events into bins with different values ofρ kφk =ρ k 1 s k ∂s k ∂θ . In other words, ifρ andφ were observable properties of each event, an optimal strategy for measuring θ would consist in the fit of the 1-D distribution ofρφ, or alternatively the fit of the 2-D distribution ofρ versusφ. I note, in passing, that these are optimal variables for both binned and unbinned fits.
Neitherρ norφ, however, are observable properties of an event. The question, then, is which partitioning variables should be used to ensure that different bins of their distributions contain events with different values ofρφ. When formulated in these terms, this is a clear example of a ML problem that can be solved by a regression algorithm: given the multidimensional space of event observables x, the problem consists in finding the two functions of these observables whose distributions Rρ(x) and Rφ(x) "best" approximateρ(x) andφ(x), or in other words maximally separate phase space regions with different relative abundances of signal and background, and signal events with different sensitivities to the parameter θ.
If a scoring classifier is used for event selection, the score D is often used instead of a regression predictor Rρ forρ, even if D is actually the output of a classification algorithm (the distinction is blurred). As discussed, total cross sections are often measured by fitting the D distribution; this is an optimal strategy, because D is "best" trained to separate regions with different values ofρ, while a predictor Rφ forφ is not needed asφ(x) = 1/σ s is a constant.
For fits of a generic parameter θ like a particle mass or coupling, one or more kinematic observables (such as invariant masses or transverse momenta) are generally used to separate events with different values ofφ, rather than building a regression predictor Rφ forφ. In the "optimal observable" method [16], specific functions of the observed event kinematics that would provide optimal separation if computed from the true event kinematics are used, but the separation power of these observables is degraded by the finite dector resolution. Another approach that I would suggest to try out, but I have not yet worked on concretely, consists in storing the event-by-event MC weight derivative γ i = 1 for a sample of unweighted simulated signal events, and training a predictor Rφ(x) against γ i so that it would "best" predictφ(x). It is easy to see, in fact, that the bin-by-bin sensitivityφ k = 1 s k ∂s k ∂θ is simply the average γ k of the event-by-event sensitivities γ i in bin k. A 2018 paper by Brehmer et al. [17] has suggested a similar approach using likelihood gradients. Note that background contamination effectively decreases the bin-by-sensitivity toρ kφk because it dilutes s k signal events of average sensitivityφ k with b k background events whose sensitivity is 0.

Fisher information for training classification and regression trees
A partitioning strategy using a classifier D to predictρ and a predictor Rφ forφ, however, is only truly optimal to minimize the measurement error on θ if these algorithms are "best" trained according to a metric that is relevant to ∆θ. I focus on Decision Tree (DT) algorithms [18], which are commonly used in HEP [14] and are the natural companions of binned fits as both describe distributions in terms of disjoint event partitions, the "bins" of a fit and the "nodes" of a tree. DT algorithms proceed by iteratively splitting nodes into two sub-nodes. Given a node with n 0 events, amongst all possible splits into two nodes with n 1 and n 2 events, that with the highest loss −∆H = n 0 h 0 −n 1 h 1 −n 2 h 2 in the impurity H = k n k h k is chosen, and the process stops when the loss -∆H is below a threshold, and/or other conditions are met.
The impurity h k of a node k would generally be computed as a function of the local purityρ k using one of two criteria in classification trees, Gini diversity H Gini = k 2n kρk (1−ρ k ) or Shannon information entropy H entropy = k n k [ρ k log 2ρ k +(1−ρ k ) log 2 (1−ρ k )], and using the mean squared error H MSE = k n k [ 1 n k i∈k (γ i − γ k ) 2 ] in a regression tree forφ = γ . My proposal is to use Fisher information in both cases, H Fisher = −I θ = − k n kρ 2 kφ 2 k . Maximising the impurity loss -∆H would thus be equivalent to minimising the Fisher information gain ∆I θ in Eq. 6.
In practice, an approach to optimize a measurement of θ could be the following: first, train a regression tree Rφ forφ on signal MC events, against γ i ; then, train a classification tree D (i.e. a regression tree forρ) on both signal and background MC, using the predictor Rφ in the loss function; finally, perform a 2-D binned fit on D and on the predictor Rφ.
With respect to the Gini and entropy criteria for classification, H Fisher has two main differences: first, it is asymmetric inρ, e.g. pure signal nodes (ρ k = 1) are very valuable, while pure background nodes (ρ k = 0) are worthless; second, it weighs nodes according to their sensitivityφ. Quite surprisingly, however, for problems like σ s measurements whereφ is a constant, it is easy to show that the Gini and Fisher splitting strategies are equivalent, because −∆H Gini 2 = −s 1 1− s 1 n 1 − s 2 1− s 2 n 2 + (s 1 + s 2 ) 1− s 1 + s 2 n 1 +n 2 = (s 1 n 2 − s 2 n 1 ) 2 n 1 n 2 (n 1 + n 2 ) = −∆H Fisher . (7) One advantage of the approach I suggest is that it makes it extremely easy to interpret and visualize the loss functions used for classifier training in terms of the metrics used for its evaluation, because both aim at maximising the information I θ . Taking the example of a classification tree, which is essentially a regression tree D forρ k in my approach, this is best understood by plotting the different ROCs and the FIP 2 distribution for the same algorithm, using many different training and validation samples drawn randomly from the same toy model distribution [15]. While in the "ideal" case the ROC passes through (ǫ b , ǫ s ) = (0, 1) and FIP 2 equals 1, possessing the knowledge of the real multi-dimensional distribution would only allow reaching a "limit" ROC and a limit value of FIP 2 . When testing a classifier against a toy model, the distribution of FIP 2 over the training sample would generally be higher than the limit, while on the validation sample it would generally be lower than the limit, with the distance between the two peaks representing a measure of overtraining in the algorithm.

Conclusions, outlook and acknowledgements
The main message of this paper is that binary classifiers must be evaluated using different metrics, specific to the goals of the problem to which they are applied, and that no single scalar metric exists which is appropriate to any classification problem in any domain. In particular, I pointed out that AUC is of limited relevance in HEP event selection and proposed new metrics based on Fisher information, which can be used for both the evaluation and training of event selection algorithms in statistically limited measurements of one parameter. Ciulli for many useful discussions. I am grateful to my former colleagues in ALEPH, where many of the ideas I reported in this article originated. I thank S. Gleyzer and the anonymous referee for useful feedback on my initial submission. Plots and further details on this work are available in the slides of the CHEP 2018 talk [15] described in this paper. This research will be discussed in more depth in an upcoming longer paper, including more complete references and acknowledgements.