Optimising HEP parameter fits via Monte Carlo weight derivative regression

HEP event selection is traditionally considered a binary classification problem, involving the dichotomous categories of signal and background. In distribution fits for particle masses or couplings, however, signal events are not all equivalent, as the signal differential cross section has different sensitivities to the measured parameter in different regions of phase space. In this paper, I describe a mathematical framework for the evaluation and optimization of HEP parameter fits, where this sensitivity is defined on an event-by-event basis, and for MC events it is modeled in terms of their MC weight derivatives with respect to the measured parameter. Minimising the statistical error on a measurement implies the need to resolve (i.e. separate) events with different sensitivities, which ultimately represents a non-dichotomous classification problem. Since MC weight derivatives are not available for real data, the practical strategy I suggest consists in training a regressor of weight derivatives against MC events, and then using it as an optimal partitioning variable for 1-dimensional fits of data events. This CHEP2019 paper is an extension of the study presented at CHEP2018: in particular, event-by-event sensitivities allow the exact computation of the"FIP"ratio between the Fisher information obtained from an analysis and the maximum information that could possibly be obtained with an ideal detector. Using this expression, I discuss the relationship between FIP and two metrics commonly used in Meteorology (Brier score and MSE), and the importance of"sharpness"both in HEP and in that domain. I finally point out that HEP distribution fits should be optimized and evaluated using probabilistic metrics (like FIP or MSE), whereas ranking metrics (like AUC) or threshold metrics (like accuracy) are of limited relevance for these specific problems.


Introduction
The point estimation of physics parameters, such as the measurement of a cross section or of a particle's mass or couplings, is an important category of data analysis problems in experimental High Energy Physics (HEP). Optimizing these measurements ultimately consists in minimizing the combined statistical and systematic errors on the measured parameters. In this paper, I only discuss the minimization of the statistical error ∆θ, in the measurement of a single parameter θ from the binned fit of a multi-dimensional distribution of selected events. This implies the optimization of two analysis handles: event selection, i.e. the criteria for signal-background discrimination, and event partitioning, i.e. the choice of binning variables.
This article follows up on that I presented at CHEP2018 [1]. As in that occasion, two central points of my study are a discussion of evaluation and training metrics for the data analysis tools used in the measurement, and a comparison of these metrics to those used in other scientific domains. The starting point of this analysis is, again, the calculation of the statistical error ∆θ in a binned fit for the parameter θ and its comparison to the minimum error ∆θ (ideal) which could be achieved in an "ideal" case. Minimizing ∆θ is equivalent to maximizing (∆θ (ideal) ) 2 /(∆θ) 2 , a metric in [0,1] that I refer to as "Fisher Information Part" (FIP).
This research differs from and extends my CHEP2018 work in two respects. First, it shifts the focus from event selection, which is a binary classification problem, to event partitioning, and it shows that the latter can be addressed as a non-binary regression problem. The key improvement is the derivation of ∆θ in terms of the event-by-event sensitivity γ i of each event i to the parameter θ, rather than in terms of the bin-by-bin sensitivity in bin k (which is simply the average event-by-event sensitivity γ k in that bin). I show that the optimal partitioning strategy consists in binning events according to their sensitivity γ i , and I use this to derive the minimum error ∆θ (ideal) achievable with an ideal detector and an ideal analysis method. While γ i can be computed for Monte Carlo (MC) events from the derivative of their MC weight with respect to θ, however, γ i is not available for real data events: the practical strategy I suggest consists in training a regressor q i of γ i on MC events, and using it as an optimal partitioning variable for a 1-dimensional fit of data events. The FIP metric can be used both for evaluating the quality of the result, and as a loss function for training the regressor q i . In this context, where only statistical errors are considered, event partitioning can be seen as a generalization of event selection, which is a simpler, binary, sub-case. Rather than simply separating signal events, which are sensitive to θ, from background events, which are not, the problem to address is how to resolve, i.e. separate, events with different sensitivities to θ: this ultimately represents a non-dichotomous classification problem.
The second new contribution of this research is the comparison to other non-HEP scientific domains, beyond those I had previously considered. In my CHEP2018 study, I had mainly considered the evaluation metrics for binary classification problems in Medical Diagnostics, Information Retrieval, and Machine Learning research. I had also briefly discussed a few metrics used in those fields to go beyond a strictly dichotomous categorization of the true event categories, or to take into account the ranking of events when a scoring classifier is used instead of a binary discrete classifier. In this paper, I extend this comparative analysis by pointing out the close relationship between FIP and two metrics commonly used in Meteorology (the "Brier score" and the "Mean Squared Error" or MSE), and the importance of "sharpness" both in HEP and in that domain. More generally, I suggest that HEP distribution fits should be optimized and evaluated using probabilistic metrics (like MSE, or FIP) as is commonly the case in Meteorology and Medical Prognostics, whereas ranking metrics (like the "Area Under the ROC Curve" or AUC) or threshold metrics (like "accuracy"), which are widely used in Medical Diagnostics, are of limited relevance for these specific problems.
The outline of this paper is the following. Section 2 describes a mathematical framework for discussing statistical error minimizations in HEP parameter fits, and the use of MC weight derivative regression to optimize event partitioning. It also discusses the relationship between FIP and MSE as training metrics for Decision Tree regressors, using a decomposition of MSE into calibration and sharpness that is copied from Meteorology. Section 3 points out the relevance of probabilistic metrics, more than threshold or ranking metrics, in both HEP and Meteorology. An outlook for this research and some conclusions are given in Section 4.

Statistical errors in HEP binned fits of a parameter θ θ θ
Binned fits for a HEP parameter θ rely on splitting all selected events into K disjoint partitions, or "bins", according to the values of one or more variables that are computed as functions of the observed properties x i of each event i. When only statistical errors are considered, the Fisher information I θ about θ which is gained from its measurement, i.e. the inverse of the square of the statistical error ∆θ, is easily shown [1] to be the sum of the information contributions from the independent, and a fortiori uncorrelated, measurements of θ in these K bins, (1) where n k (θ) = s k (θ)+b k is the number of selected events in bin k. This is the sum of the number of signal events s k , which depends on θ, and that of background events b k , which does not.

MC reweighting and event-by-event sensitivities
In practice, HEP fits of a parameter θ rely on the theoretical prediction of the number of signal events s k (θ) in bin k as a function of θ, obtained through MC simulations. A relatively standard practice to derive s k (θ) is the MC reweighting technique, which, for instance, was used extensively by the LEP experiments in the late 1990s, for measurements of both particle masses [2,3] and particle couplings [4]. This technique is also applicable to hadron colliders [5], where it has been shown that is generally feasible also at NLO accuracy [6]: it has been pointed out [5], in particular, that it is conceptually and practically simpler than the Matrix Element Method [7,8], which has been extensively used at hadron colliders [9][10][11], because it does not imply the time-consuming integration over undetermined momenta which is necessary in that method, and which can be performed by tools such as MadWeight [12]. Monte Carlo reweighting essentially consists in the following three steps. First, a sample of MC events for the signal process is generated at a reference value θ ref of the parameter θ, and a weight w i (θ ref ) is assigned to each event i; if unweighted events are generated, they all have the same w i (θ ref ), but this is not strictly needed. Second, generator-level events are passed through full detector simulation. Third, each detector-level event i is assigned a weight w i (θ) at another value of the parameter θ; this is done by rescaling w i (θ ref ) by the ratio of the predicted probabilities for θ and θ ref of event i, as described by its MC truth (generator-level) properties x (true) i . The probability ratio is typically just a ratio of squared matrix elements, The above description applies to signal MC events, but each background MC event is also assigned a weight w i , with the important difference that, by definition, it does not depend on θ. Assuming that all weights w i take into account a normalization factor to the luminosity of the data, the expected number of selected signal and background events n k (θ) in bin k, as a function of θ, can be written as the sum of the event weights w i for all MC events i in bin k, The bin-by-bin sensitivity of n k to θ which appears in Eq. 1 can then be written as i.e. as the weighted average over all MC events i in bin k, of the event-by-event sensitivity Note that all γ i (and hence I θ ) depend on the value θ I of θ where w i and ∂w i /∂θ are computed (typically, θ ref ). In a given binning scheme, the information I θ of Eq. 1 can then be written as

Beyond the signal-background dichotomy
For individual signal events i, the event-by-event sensitivity γ i may be positive or negative, and the absolute value of γ i may also be significantly different from one event to another. Background events, conversely, all have a zero event-by-event-sensitivity, because these events, by definition, are produced by processes that are insensitive to the parameter θ: Equation 6 shows that the largest contributions to the information I θ come from the bins with the largest average event-by-event sensitivities. As discussed more in detail later on, a good measurement is therefore one satisfying two criteria: first, the event selection accepts the events with sensitivities γ i that are significantly different from zero, whether positive or negative, i.e. those with high absolute values of γ i (in the following I will refer to these as events with high sensitivities, but it should be implicitly understood that I refer to their absolute values); second, the event partitioning resolves events with very different sensitivities into separate bins, as it is the average bin-by-bin sensitivity that determines the contribution to I θ .
As an example, consider the fit of the mass M of a particle from the distribution of the invariant mass m of its decay products. The sensitivity γ i to M is positive for the signal events on the right of the mass peak (m > M) and negative for those on its left (m < M). The events with the highest sensitivity (in absolute value), in particular, are those on the steep ascending and descending slopes to the left and to the right of the mass peak. Conversely, those immediately below the mass peak or those on the tails far away from it have a sensitivity that is close to 0. These low-sensitivity signal events are not very different from background events, as the information about θ that they provide is extremely limited, and in both cases it is important to separate them from high-sensitivity signal events, so as not to dilute their sensitivity.
In spite of its limitations, a dichotomous categorization of events as signal or background is still useful (especially when considering systematic errors). Using the symbols ρ k = s k /n k to indicate the selection purity and φ k to indicate the sensitivity of signal events alone in bin k, it is easy to see that γ k = ρ k φ k : the net effect of background is to dilute the overall bin-by-bin sensitivities by a factor ρ k ≤ 1, with respect to that computed from signal events alone. The same is also true for the bin-by-bin contributions to information, which can be written as: For simplicity, I will assume w i (θ I ) =1 for all signal and background events in the following. This implies that γ k = ( i∈k γ i )/n k and φ k = ( (Sig) i∈k γ i )/s k in the rest of this paper.

An ideal measurement with an ideal detector, and a realistic analysis with a limited detector
In my previous paper [1], I had shown that the optimal partitioning in a fit of θ consists in separating events into bins with different values of the bin-by-bin sensitivity (1/n k )(∂n k /∂θ). Event-by-event sensitivities make it possible to go to a much finer granularity.
If only two selected events i 1 and i 2 are expected, the "information inflow" [13] in keeping them in separate one-event bins, rather than mixing them together in a single two-event bin, is zero if γ i 1 and γ i 2 are equal, whereas it is strictly positive if they are different. In other words, in the "ideal" case where all true values of the event-by-event sensitivities γ i were known, the optimal way to measure θ would be a fit of the one-dimensional distribution of γ.

The maximum information I (ideal)
θ that is theoretically achievable in this ideal case is simply where the sum over all N tot = S tot +B tot events includes S tot signal and B tot background events, but the contribution from the latter is 0 because they have γ i = 0 as described in Eq. 7. As in Ref. [1], I suggest to evaluate the quality of a measurement using the "Fisher Information Part", a dimensionless scalar metric in [0,1], defined as the ratio between the information which was actually achieved, in Eq. 9, and that achievable in an ideal case, in Eq. 11: In Eq. 12, the numerator is a sum over bins, based on metrics derived from the N sel = K k=1 n k selected events in those bins (where N sel = S sel +B sel , including S sel signal and B sel background events), while the denominator is a sum over the S tot signal events in a given data sample. The main difference between this metric and that I had previously presented [1] is that in the past I only used FIP to evaluate the quality of event selection and signal-background discrimination in a fit with a given binning, while now I redefine it to also evaluate the quality of the binning.
FIP is a valuable metric in my opinion because it is simple to use and interpret both qualitatively and quantitatively, in statistically-limited measurements: qualitatively, in that an analysis should be optimized to achieve the highest value of FIP; quantitatively, in that its numerical value is proportional to 1/∆θ 2 , where ∆θ is the statistical error on the measurement. Another useful feature is that, since it is a ratio between 0 and 1, FIP can be decomposed as the product of several independent metrics which are also ratios between 0 and 1. In particular, I propose to distinguish between three effects which can result in information loss, and I decompose FIP 3 in Eq. 12 as the product of three ratios, each taking values between 0 and 1: The symbols FIP efS , FIP shS and FIP shB denote that these ratios represent effective measures of signal efficiency and of signal and background "sharpness". FIP efS is an information-weighted signal selection efficiency, describing the loss of information in rejecting some events: it is the ratio between the S sel selected and S tot total signal events, where each event is weighted by its information contribution γ 2 i , the square of its event-by-event sensitivity. FIP shS measures the sharpness at resolving selected signal events with different sensitivities γ i , i.e. at partitioning them into different bins of the distribution fit, S sel = k s k : it is the ratio of the information achieved in the chosen binning K, to that theoretically achievable if it were possible to partition signal events according to the true value γ i of their sensitivity to θ. FIP shB is an information-weighted signal selection purity, describing the loss of information due to an imperfect background rejection, in a given binning scheme K: it too measures a "sharpness", that at resolving background events (with γ i = 0) from signal events (of any sensitivity γ i ).
While I suggest the use of FIP efS , FIP shS and FIP shB as figures of merit for the informationweighted efficiency and signal and background sharpness achieved by the final analysis stage of a measurement, it is important to point out that, for all these three effects, the maximum achievable figure of merit with a realistic detector may be lower than 1 even if the best possible analysis method is used. Some loss of information may in fact be inevitable given the limitations of the detector, but also those of the computing and data processing chain which precedes the final analysis stage of a measurement. This is shown schematically in Fig. 1. To start with, the S tot signal events in a final analysis sample may be fewer than the S ALL signal events produced in beam collisions in the given data taking period, because of detector acceptance, trigger decisions and preselection cuts: this may be taken into account by another ratio FIP ACC , analogous to FIP efS and lower than 1, by which the analysis-level FIP 3 should be multiplied to obtain the overall FIP ALL metric for the measurement. FIP shB , or FIP shS , respectively, may be lower than 1 because the limited resolution of the detector mixes together signal events with different sensitivities γ i , or mixes together signal events and background events, respectively, making them experimentally indistinguishable. For a real detector, even the best possible analysis method can at most try to determine, at each point x of the observable phase space, the average local sensitivity of signal events φ(x) = γ Sig (x) = (1/s(x))(∂s(x)/∂θ) and the average local purity ρ(x) = (s(x))/(s(x)+b(x)) that the detector resolution effectively establishes. In these expressions, s(x) and b(x) indicate the differential distributions of signal and background events in x-space, with s(x)dx = S tot and b(x)dx = B tot .
While the framework I propose describes the general case where signal events have different sensitivities γ i to θ and are thus not all equivalent to one another, it also describes a much simpler case where signal events all have the same sensitivity γ i , namely the measurement of a total signal cross section σ s . In this case, which I discussed in Ref. [1], the only challenge is the classic binary classification problem of signal-background discrimination in the presence of strictly dichotomous true categories. As there is no need to resolve signal events from one another, FIP shS is always 1 in this case. If σ s is measured by a counting experiment (i.e. using a single bin), FIP 3 reduces to to FIP 1 = ǫ s ̺ [1], the product of the global signal selection efficiency FIP efS = ǫ s and purity FIP shB = ̺, a metric that has been widely used in HEP already since the late 1990s [2,4,[14][15][16]. Another common way to measure σ s is the fit of a scoring classifier distribution: examples include fits of Neural Network or Rarity distributions at LEP [17] and fits of Boosted Decision Trees at the Tevatron [18,19] and LHC [20]. In this case, FIP efS = 1 because all pre-selected events are included in the fit, while FIP shB reduces to FIP 2 = ( k s k ρ k )/( k s k ) [1], because γ i is the same for all signal events.

Monte Carlo weight derivative regression
To optimize the measurement of θ from a sample of N tot events, it would then be enough to know a single property of all events, their sensitivity γ i to θ. The fit of the one-dimensional distribution of γ i would provide optimal partitioning and background rejection, and achieve the minimum statistical error ∆θ (ideal) . While γ i can be computed for MC events, however, γ i is not available for real data. The practical strategy I suggest is to train a regressor q i of γ i on MC events, i.e. a regressor of the MC weight derivatives (1/w i )(∂w i /∂θ) computed from the generator-level properties x (true) i of MC events, and use it to fit θ from the one-dimensional distribution of q(x i ) on data events, computed from their detector-level properties x i . I refer to this approach as "Weight Derivative Regression" (WDR).
In such a crude form, this method is probably of little applicability in many practical situations, and more refined variations should be used to overcome some of its limitations. The main issue is that the MC weight derivatives (1/w i )(∂w i /∂θ) depend on the value θ I of θ at which they are computed: this dependency may be weak in fits of particle couplings, but is certainly strong in fits of particle masses. It may be necessary to compute these derivatives at more than one value of θ I , and possibly train more than one regressor, using them to measure θ from a multi-dimensional fit. A separate binary classifier for background rejection may also be useful, especially to handle systematic errors. A more detailed discussion of the limitations of this method, and practical examples of its use, are foreseen for later publications.
I stress that the method I suggest has clear similarities with, and was strongly inspired by, the "Optimal Observables" (OO) approach [21][22][23][24]. There is, however, an important difference, which schematically is the following: the WDR method consists in computing the true sensitivity γ i of each MC event i from its generator-level properties x (true) i , and training the regressor q i = q(x i ) against these true γ i , to obtain an estimate q(x) of the functional dependency of the local average sensitivity γ (x) = φ(x)ρ(x) on the detector-level properties x for real data events; the OO method approximately consists, instead, in analytically computing the functional dependency of γ i on x (true) i , and applying that same functional dependency on the observed x to obtain an estimate of γ (x) for real data events. As a consequence, the results that can be obtained through the OO method are significantly degraded by the effect of the experimental detector resolution, which is not properly accounted for.
The regressor q i = q(x i ) of the sensitivity γ i may be implemented in many different ways.
Selecting a specific algorithm essentially means choosing two things: the parametrization of the q(x) function, and the metric to use for training the regressor. As in Ref. [1], I focus on Decision Tree (DT) algorithms [25], and I suggest that the maximization of FIP 3 should be used both for evaluating the measurement and for training the regressor. In a DT, the space of detector-level event properties x is split into K disjoint nodes, such that q(x) = q (k) is a constant in each node k. Taking into account that each node of the tree may be used as a bin in the fit, the goal is to split all N sel =N tot events in the training sample into K nodes/bins, with n k events in node/bin k, so as to maximize FIP 3 in Eq. 12. It is extremely interesting to see that this is equivalent to using a much more common criterion, the minimization of the Mean Squared Error (MSE). It is easy to prove, in fact, that the MSE can be decomposed as follows, where the "calibration" MSE cal is 0 by construction in training the DT, as q (k) is defined as the average sensitivity γ k of the MC events in node k, while the "sharpness" MSE sha is minimized when FIP 3 (or more precisely FIP shS × FIP shB , as N sel =N tot ) is maximised, because For other algorithms, such as Neural Networks, where implementing FIP maximization is not as easy as in a DT, minimizing MSE is probably still a sensible training criterion.

Learning from others: probabilistic metrics in Meteorology
I now take a step backwards to consider the more general perspective of evaluation and training metrics in different scientific domains. The reason why metrics like FIP and MSE are relevant to HEP parameter fits is that they capture their most characteristic feature, the simultaneous use of disjoint event partitions to derive a measurement of θ which is effectively a combination of the measurements performed in these individual partitions. It should be noted in passing that most of the ideas in this paper are relevant for both binned and unbinned fits, even if their applicability is more obvious in the case of binned fits. In my previous study [1], I noted that event partitioning is largely unaccounted for by the evaluation metrics commonly used in Medical Diagnostics (MD), Information Retrieval (IR) and Machine Learning (ML). Further research led me to understand two things: first, that a key point is the categorization [26][27][28][29] of performance metrics into three distinct families, namely threshold, ranking and probabilistic metrics; and, second, that MD, IR and ML mainly focus on binary classification problems described by threshold and ranking metrics, whereas HEP parameter fits require probabilistic metrics, which are widely used for regression problems in domains such as Meteorology and Climatology, or Medical Prognostics. Threshold metrics are relevant in classification problems where all events are assigned to a signal or background category by a discrete binary classifier. This includes the case when the operating point of a scoring classifier is chosen on its ROC [30][31][32][33][34][35][36][37][38] curve (for instance based on a cost matrix), a popular approach in MD [39][40][41][42][43][44][45][46]. Classifiers are evaluated from the four event counts in a two-by-two confusion matrix, namely True/False Positives/Negatives. The simplest threshold metric is accuracy, which is widely used, but is known to have severe limitations, in both MD [47,48] and ML [49][50][51][52][53]. A popular threshold metric in IR [54][55][56][57][58][59][60] is the F1 score: this is based on precision and recall, which in HEP are known as purity ̺ and efficiency ǫ s . In HEP, threshold metrics are especially useful in counting experiments: examples include cross section measurements by counting, where the relevant metric is FIP 1 = ǫ s ̺, as discussed, but also searches for new physics [61][62][63][64] that are not based on distribution fits. An interesting way to compare different threshold metrics is to study their symmetries and invariances [65,66]. A fundamental feature of HEP measurements, in particular, is the irrelevance of the True Negatives count, i.e. of the number of rejected background events: in this respect, HEP is more similar to IR than it is to MD, as I briefly discussed in Ref. [1].
Ranking metrics are relevant in classification problems where all events are assigned a score D by a scoring classifier, representing their probability to belong to the signal category. Events can then be ranked by their score, which is especially important if some prioritization is needed. Ranking metrics such as precision for a fixed number of retrieved documents, or a fixed fraction of all available documents, are often used in IR [67][68][69][70]. The most commonly used ranking metric is however the Area Under the ROC Curve (AUC), which is popular in MD [71][72][73][74][75] because it represents "the probability that a randomly chosen diseased subject is correctly ranked with greater suspicion than a randomly chosen non-diseased subject". The AUC is however known to have severe limitations for both MD [76][77][78][79] and ML [80][81][82][83][84][85]. Ranking metrics are an active area of research in ML [86][87][88], which was also investigated in HEP [64]. In my opinion [1], however, ranking metrics, and in particular the AUC, are largely irrelevant in HEP measurements: while threshold metrics are needed in counting experiments, for distribution fits one should use metrics describing event partitioning, not event ranking. In a cross section fit from the distribution of a scoring classifier D, for instance, a metric like FIP 2 is relevant because it describes the fit as a combination of measurements from subsets of events with different values of D, independently of which event subset has a higher score.
A related challenge in HEP distribution fits is that signal events are not all equivalent to one another, as they have different sensitivities γ i . Research on metrics for non-dichotomous evaluation has been active on non-binary gold standards in MD [89][90][91], on graded relevance assessment in IR [92][93][94] and on cost-sensitive classification in ML [95][96][97][98][99][100], involving threshold, ranking and probabilistic metrics, and even discussing the issue of the calibration of probabilistic classifiers [101,102]. In my opinion, however, a more appropriate solution for HEP distribution fits may come from probabilistic metrics in other domains.
Probabilistic metrics are relevant in classification and regression problems where the comparison of a predicted property of an event to its true value has a probabilistic interpretation. Verification scores of forecasts in Meteorology and Climatology [103][104][105][106][107][108][109], such as MSE and the closely related Brier score, are typical probabilistic metrics. Similar metrics are also used for the evaluation of patient health predictions in Medical Prognostics [110,111]. In both cases, the quality of forecasts is assessed by comparing a forecast probability of a future weather event, or of a future disease, to the relative frequency which is eventually observed for that event. Partitioning is an essential component of this approach: for instance, ten different forecast groups may be studied, each covering a 10% probability range, with the third group including days (or patients), with a 20 to 30% probability of rain (or of survival after 5 years, respectively). A good forecast is one with two features: first, reliability or calibration, i.e. the actual fraction of rainy days must be ∼25% for forecasts in the 20-30% range; second, sharpness or resolution, i.e. it must be able to distinguish between days with a ∼25% probability and days with a ∼75% probability of rain. As discussed in Sec. 2, probabilistic metrics like MSE, and the concepts of sharpness and calibration of a regressor are also relevant to describe HEP parameter fits: the decomposition in Eq. 14 was, in fact, copied from that of the Brier score into a calibration and a sharpness term in Meteorology [104].

Outlook and conclusions
I have described a mathematical framework to evaluate HEP parameter fits, and suggested a MC Weight Derivative Regression approach to optimize them. Data analysis methods are similar across scientific domains, and HEP can learn a lot from others; but different problems require different metrics, and it is important to select from other domains the tools that make sense for us. I pointed out in particular that ranking metrics like the AUC, a standard practice in Medical Diagnostics, are of limited relevance for HEP, while probabilistic metrics like the MSE and the concepts of calibration and sharpness, commonly used in Meteorology, are directly applicable in our field. I have not discussed systematic errors, or searches for new physics based on distribution fits, but I hope that this work can stimulate research in that direction. Further details on this work are available in the slides of the CHEP2019 talk [112] described in this paper. A more detailed article is also planned for the future.