C ○ Owned by the authors, published by EDP Sciences, 2013 Setting limits and application to Higgs boson search

This lecture summarizes the basic concept of hypothesis testing, will intro- duce the concepts of significance and upper limit under the frequentist and Bayesian approaches, and will discuss the benefits and limitations of the most popular approaches. Special attention will be devoted to the so-called modified frequentist approach, which is a popular method in High Energy Physics, and some application to real physics cases will be discussed.


Introduction
Experiments searching for rare or unknown processes have to quantify how evident the signal they look for is.The evidence is not always sufficient to claim a discovery, and in many cases it is interesting to quote among the published results the upper limit on the expected signal yield.From such limit, one can indirectly derive limits on the properties of the new signal that influence the signal yield, such as the mass of a new particle.
The determination of upper limits is in many cases a complex task and the computation frequently requires numerical algorithms.Several methods are adopted in High Energy Physics and are documented in literature to determine upper limits.The interpretation of the obtained limits can be, even conceptually, very different, depending on the adopted method.
This lecture summarizes the basic concept of hypothesis testing, will introduce the concepts of significance and upper limit under the frequentist and Bayesian approaches, and will discuss the benefits and limitations of the most popular approaches.Special attention will be devoted to the so-called modified frequentist approach, which is a popular method in High Energy Physics, and some application to real physics cases will be discussed.

Hypothesis testing
A key task in most of physics measurements is to discriminate between two or more hypotheses on the basis of the observed experimental data.One typical case is to discriminate a signal under study against background processes.This problem is addressed in statistics by the hypothesis tests, which defines a procedure to assign an observation to one of two or more hypothetical models considering their predicted probability distributions.One typical example is to determine whether a sample of events is composed of background only or contains a mixture of background plus signal events.The discrimination between the two hypotheses can be performed on a statistical basis looking at the observed measurements of specific discriminating variables.Another typical example in physics is the identification of a particle type (e.g.: as a muon vs pion) on the basis of the measurement of a number of discriminating variables (e.g.: the depth of penetration in an iron absorber or the energy release in scintillator crystals, etc.).
In literature typically two hypotheses are considered called null hypothesis, H 0 , and alternative hypothesis, H 1 .Assume that the observed data sample consists of the measurement of a number k of variables, x = (x 1 , • • • , x k ) which are randomly distributed according to some probability density function (PDF), which is in general different for the hypotheses H 0 and H 1 .A measurement of whether the observed data sample better agrees with H 0 , or rather with H 1 can be given by the value of a function t( x), called test statistics, whose PDFs under the considered hypotheses can be derived from the PDFs of x.One simple example is the use of a single variable x which has discriminating power between two hypotheses, as shown in Fig. 1, in the sense that the PDF of x under the hypotheses H 1 = signal and H 0 =background are appreciably different.On the basis of the observed value x of the discriminating variable x the test statistics can be defined as the measured value itself: A selection requirement (in jargon always called cut) can be defined by identifying a particle as a muon if t ≤ t cut , or as a pion if t > t cut , where the value t cut is chosen by the experimenter.Not all real muons will be correctly identified as muon according to this criterion, as well as not all real pions will be correctly identified as pions.The expected fraction of selected signal particles (muons) is usually called signal selection efficiency and the expected fraction of selected background particles (pions) is called misidentification probability.Misidentified particles constitute a background to positively identified signal particles.Statistical literature defines the significance level α as the probability to reject the hypothesis H 1 if it is true.The case of rejecting H 1 if true is called error of the first kind.In our example, this means selecting a particle as a pion in case it is a muon, hence the selection efficiency for the signal corresponds to 1 − α.The probability β to reject the hypothesis H 0 if it is true (error of the second kind) is the misidentification probability, i.e.: the probability to incorrectly identify a pion as a muon, More complex examples of cut-based selections involve multiple variables, where selection requirements in multiple dimensions can be defined as regions in the discriminating variables space.Events are accepted as "signal" or as "background" if they fall inside or outside the selection region.Finding an optimal selection in multiple dimensions is usually not a trivial task.Two simple example of selections with very different performances in terms of efficiency and misidentification probability are shown on Fig. 3.

The Neyman-Pearson lemma
In order to optimize the performances of a selection one has to achieve a large selection efficiency corresponding to a small misidentification probability.For a fixed signal efficiency, ε = 1 − α, the Neyman-Pearson lemma [1] allows to determine a selection which has the lowest possible misidentification probability β based on the ratio of the likelihood functions of the observed data sample x determined under the two hypotheses H 1 and H 0 .The adopted test statistics is defined as: The signal selection requirement based on λ is: where k α is a constant which can be determined given a fixed value of α.
If the k variables x 1 , • • • , x k that characterize our problem are independent, the likelihood function can be written as the product of one-dimensional PDFs: This allows in many cases to simplify the computation of the likelihood ratio and to easily obtain the optimal selection.In concrete examples it is not always easy to find the exact functional form of λ.
Numerical methods and algorithms exist to find selections in the variable space that have performances in terms of efficiency and misidentification probability close to the optimal limit given by the Neyman-Pearson lemma.There are cases in which those algorithms achieve great complexity.Among such methods some of the most frequently used in High Energy Physics are Artificial Neural Networks and Boosted Decision Trees, which are treated in this series of lectures.
In case we have a sample consisting of n events, each determined from the observation of the k variables x 1 , • • • , x k , the likelihood function corresponding to the entire sample can be written as the product of PDFs evaluated at the observed variables x i , i = 1, • • • , n, for each event: Above, the hypotheses H 1 and H 0 are represented as two possible sets of values of the parameters θ = (θ 1 , • • • , θ m ) that characterize the PDFs.Usually we want to use the number of events n as information in the likelihood definition, hence we use the extended likelihood function defined as the product of the usual likelihood function and a Poissonian probability corresponding to the observed number of events N: In the Poissonian term the expected number of event ν may also depend on the parameters θ: ν = ν( θ).Typically, we want to discriminate between two hypotheses, which are the presence of only background events in our sample (ν = b) or the presence of both signal and background are present (ν = s + b).The signal strength is usually introduced to measure the ratio of the signal yield to its theoretical prediction: 03003-p.4

IN2P3 School Of Statistics, Autrans
The hypothesis H 0 corresponding to the presence of background only is equivalent to μ = 0, while the hypothesis H 1 corresponding to the presence of background plus signal is equivalent to μ = 1.The PDF f ( x i ; θ) can be written as superposition of two components, one PDF for signal and another for background, weighted by the expected signal and background fractions, respectively: In this case the extended likelihood function, Eq. 6 becomes: The term 1/n! disappears when performing the likelihood ratio in Eq. ( 2).

Wilks' theorem
In the case of a large number of events, it is useful to have an approximation of the likelihood ratio defined in Eq. 2. Using Wilks' theorem [2], assuming some regularity conditions of the likelihood function, the quantity: where the parameter values ˆ θ 0 and ˆ θ 1 are taken as the maximum likelihood estimates of θ corresponding to the observed data sample x in the two hypotheses H 0 and H 1 respectively, can be asymptotically approximated with a χ 2 distribution having a number of degrees of freedom equal to the difference between the number of free parameters (i.e.: not constrained from the fit) in L( x|H 1 ) and L( x|H 0 ) [3].

Claiming a discovery: significance
Given an observed data sample, claiming a discovery of a new signal requires to determine that the sample is sufficiently inconsistent with the hypothesis that only background is present.A test statistics can be used to measure how consistent or inconsistent the observation is with the hypothesis μ = 0.
A quantitative measurement of the inconsistency with the background-only hypothesis is given by the significance, defined from the probability p (p-value) that the considered test statistics t assumes a value greater or equal to the observed one (large values of t corresponds to more signal-like sample) in the case of pure background fluctuation.The p-value has a uniform distribution between 0 and 1 for the background-only hypothesis, and tends to have small values in the presence of a signal.The distribution is more peaked towards zero in the presence of signal as there is better separation between signal and background.
Instead of quoting the p-value, publications often preferred to quote the equivalent number of standard deviation that correspond to an area p under a, extreme tail of a normal distribution.So, one quotes a "Zσ" significance corresponding to a given p-value by using the following transformation: 03003-p.5

EPJ Web of Conferences
By convention in literature one claims the "observation" of the signal under investigation if the observed significance is at least 3σ (Z = 3), which corresponds to a probability of background fluctuation of 1.35 × 10 −3 , while one claims the "evidence of" the signal (discovery!) in case the significance is at least 5σ (Z = 5), corresponding to a p-value of 2.87 × 10 −7 .Table 1 shows a number of typical significance values expressed as 'Zσ" and their corresponding p-values.Determining the significance, anyway, is only part of the process that leads to a discovery, in the scientific method.Quoting from Ref. [4]: "It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon.One's degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data.Here, however, we only consider the task of determining the p-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as "discovery"." In order to evaluate the "plausibility of a new signal" and other factors that give confidence in a discovery requires a judgement that cannot, of course, be replaced by the satistical evaluation only.

Excluding a signal hypothesis
For the purpose of excluding a signal hypothesis, usually the requirement applied in terms of p-value is much milder than for a discovery.Instead of the requiring a p-value of 2.87 × 10 −7 or less (5σ), the upper limits for an exclusion are set requiring p < 0.05, corresponding to a 95% confidence level (CL) or p < 0.10, corresponding to a 90% CL.In this case, p indicates the probability of a signal underfluctuation, i.e.: the null hypothesis and alternative hypothesis are inverted with respect to the case of a discovery.

Definitions of upper limits
In the frequentist approach the procedure to set an upper limit is similar to the determination of a confidence interval for the unknown signal yield s. In the case one wants to determine au upper limit instead of a central interval, the choice of the interval with the desired CL (90% or 95%, usually) may IN2P3 School Of Statistics, Autrans be fully asymmetric, becoming s ∈ [0, s up [.When the outcome of an experiment is an upper limit, one usually quotes: s < s up at 95% C.L (or 90% CL) .
If the Bayesian approach is adopted, the interpretation of an upper limit s up is very different.The interval s ∈ [0, s up [ has to be interpreted as credible interval, meaning that its corresponding posterior probability is equal to the CL 1 − α.

Poissonian counting experiments
A simple though realistic case is a counting experiment where selected events contain a mixture of signal and background events.The total number of observed events will be on average s + b where s and b are the expected number of signal and background events, respectively.The main unknown parameter is s, which could also be equal to zero in case the signal is not present.The likelihood function in the case of a counting experiment is: where n is the observed number of events.

Bayesian approach
The easiest treatment of a counting experiment, at least from the technical point of view, can be done under the Bayesian approach.Assuming a uniform prior PDF for s, the Bayesian posterior PDF for s is given by: The upper limit s up can be computed requiring that the posterior probability corresponding to the interval [0, s up [ is equal to CL, or equivalently that the probability corresponding to In the simplest case of negligible background, b = 0, the posterior PDF for s can be demonstrated to have the same expression as the Poissonian probability itself: In the case of no observed events, n = 0, we have: and: Hence, we can set the following upper limits: 03003-p.7 EPJ Web of Conferences s up = 2.30 at 90%CL .
The general case of a possible expected background b 0 was treated by O. Helene [ 5], and Eq. 14 becomes: The above expression can be inverted numerically to determine s up for given α, n and b.In the case of no background (b = 0) Eq. ( 20) becomes: and the corresponding upper limits are reported in Tab. 2. For different number of observed events n and different expected background b, the upper limits derived in [5] are shown in Fig. 4.

Limitations of the Bayesian approach
The derivation of Bayiesian upper limits done above assumes a uniform prior on the expected signal yield s. Assuming a different prior distribution would result in different upper limits.In general, there is no univoque criterion to chose a specific prior PDF to model the complete lack of knowledge about a variable, like in this case the signal yield.This subjectiveness in the choice of the prior PDF is intrinsic in the Bayesian approach, and raises criticism by supporters of the frequentist approach, which object that the obtained Bayesian results are to some extent subjective.Supporters of the Bayesian approach reply that the obtained result are intersubjective [6], in the sense that common prior choices lead to common results, and some debates are still ongoing in literature.
A frequently adopted prior distribution in physics that models one's complete lack of knowledge of a parameter is to assume a uniform distribution, as it was done for the simple Poissonian example 03003-p.9 EPJ Web of Conferences above.This approach is anyway not unique: should we define a prior uniform in s or in ln s?A typical case is the measurement of a particle lifetime τ or, alternatively, its width Γ ∼ 1/τ.Since there is no natural choice between the two quantities, should we assume a uniform prior in τ or in 1/τ?An approach to find a prior distribution that is invariant under reparametrization of our PDF is due to H. Jeffreys [7] who suggested to chose the prior to be proportional to the square root of the determinant of the Fisher information matrix: where Using Jeffreys' approach leads to prior PDF that are usually not uniform.Table 3 shows a number of typical cases.For instance, for a Poissonian counting experiment Jeffreys' prior is proportional to 1/ √ s, not uniform as assumed to determine Eq. 15.
8 Frequentist limits: a simple case In case we observe n = 0 events we can state that the number of observed signal events is n s = 0, and the number of observed background events is n b = 0.If we also assume for simplicity that the expected background is nebligible, we can set b 0, hence we will have for any observed number of events n that n b = 0 and n s = n.The probability to observe n events when we expect s events is given by Poissonian distribution: For n = 0 we have: We can set an upepr limit on the expected signal yield s excluding values of s for which p < α = 1 − CL.Hence, we allow signal values s that satisfy: The above relation can be inverted, and gives: Those results coincide accidentally with the results obtained under the Bayesian approach and shown Table 2.The coincidence of limits under the Bayesian and frequentist approaches, like in this case, may lead to confusion.There is no intrinsic reason for which limits evaluated under the two approaches should coincide, and in general, with very few exceptions, like in this case, Bayesian and frequentist limits don't coincide.

Steps towards a frequentist approach
An effort to conciliate Bayesian frequentist limits obtained by Helene in Ref. [ 5] and the frequentist approach was attempted by G. Zech [8].In order to determine the probability distribution of the number of events from the sum of two Poissonian processes with s and b expected number of events from signal and background, respectively, one can write the probability distribution for the total observed number of events n as: This modification leads to the same result obtained by Helene in Eq. ( 20).Though the approach was later criticized [9] because it led to uncorrect coverage, and Zech himself admitted the non rigorous application of the frequentist approach, his intuition anticipates the formulation of the modified frequentist approach that will be discussed later on in Section 14.

Frequentist approach: Neyman's confidence intervals
A rigorous and general frequentist treatment of confidence intervals is due to J. Neyman [ 11].Let's consider a variable x distributed according to a PDF which depends on an unknown parameter θ.
Neyman's procedure to determine confidence intervals proceeds in two steps.First, a confidence belt is determined scanning the parameter space by varying θ within its allowed range.For each fixed value θ = θ 0 we know the corresponding PDF which describes the distribution of x, f (x|θ 0 ).According to the PDF f (x|θ 0 ) a confidence interval [x 1 (θ 0 ), x 2 (θ 0 )] is determined whose corresponding probability is equal to the specified CL= 1 − α, usually equal to 68.27%, 90% or 95%: Neyman's construction is graphically illustrated in Fig. 5, left.The choice of x 1 (θ 0 ) and x 2 (θ 0 ) has still some arbitrariness, since there are different possible intervals having the same probability, according to Eq. (33).The choice of this interval is referred to in litterature as ordering rule.For instance, one can chose an interval centered around the average value of x given θ 0 , i.e.: an interval: where δ is such to ensure that Eq. ( 33) holds.Or one can chose the interval such that One can also chose one of the two possible fully asymmetric intervals: will be discussed in Section 12.Given a choice of the ordering rule, the intervals [x 1 (θ), x 2 (θ)], for all possible values of θ, define the Neyman belt in the space (x, θ) as shown in Fig. 5.
As second step of the Neyman procedure, given a measurement x = x 0 , the confidence interval for θ is determined inverting the Neyman belt (Fig. 5, right): two extreme values θ 1 (x 0 ) and θ 2 (x 0 ) are determined as the intersections of the vertical line x = x 0 with the two boundary curves of the belt, i.e. we find the values θ = θ 1 (x 0 ) and θ = θ 2 (x 0 ).The interval [θ 1 (x 0 ), θ 2 (x 0 )] has, by construction, a coverage equal to the confidence level 1 − α.This means that, if θ is equal to a true value θ 0 , extracting x = x 0 randomly according to the PDF f (x|θ 0 ), θ 0 will be included in the determined confidence interval, [θ 1 (x 0 ), θ 2 (x 0 )] in a fraction 1 − α of the cases, in the limit of a large number of extractions.
Upper or lower limits on θ can be determined using fully asymmetric intervals for x.In particular, assuming that the Neyman belt is monotonically increasing, the choice of intervals [x 1 (θ 0 ), +∞[ leads to a confidence interval [0, θ(x 0 )] for θ which corresponds to a un upper limit θ up = θ(x 0 ).This case is illustrated in Fig. 7.

The "flip-flopping" problem
In order to determine confidence intervals, a consistent choice of ordering rule has to be adopted.Feldman and Cousins demonstrated [12] that the ordering rule choice must not depend on the outcome of the measurements, otherwise the quoted confidence intervals or upper limits could correspond to incorrect confidence level (i.e.: coverage).In some cases, experiment searching for a rare signal make the chose, while quoting their result, to switch from a central interval to an upper limit depending on the outcome of the measurement.A typical choice is to quote an upper limit if the significance of the observed signal is smaller than 3σ, and a central value otherwise.This problem is sometimes referred to in literature as flip-flopping, and can be illustrated in a simple example.Imagine a model where a random variable x obeys a Gaussian distribution with a fixed and known r.m.s., for simplicity we can take σ = 1, and an unknown average μ which is bound to be greater or equal to zero (this is the case of a signal yield).The quoted central value must always be greater than or equal to zero, given the assumed constraint.Assume we decide to quote zero if the significance is less than 3σ: From a single measurement of x we could decide to quote a central value if x/σ ≥ 3 with a symmetric error: ±σ at the 68.27%CL, or ±1.645σ at 90% CL.Instead, we may decide to quote an upper limit if x/σ < 3. The upper limit to μ can be derived using a fully asymmetric interval, and corresponds to μ < x + 1.282 at 90% CL.The quoted confidence interval at 90% CL becomes: The situation is shown in Fig. 8.The choice to switch from a central interval to a fully asymmetric interval (upper limit) based on the observation of x clearly spoils the statistical coverage.Looking at Fig. 8, depending on the value of μ, the interval [x 1 , x 2 ] obtained by crossing the confidence belt in by an horizontal line, one may have cases where the coverage decreases from 90% to 85%, which is lower than the desired CL.Next Section 12 presents the method due to Feldman and Cousins to approach consistently the coverage problem without incurring the flip-flopping problem.The plot shows the quoted central value of μ as a function of the measured x (dashed line), and the 90% confidence interval corresponding to a choice to quote a central interval for x/σ ≥ 3 and an upper limit for x/σ < 3.

The unified Feldman-Cousins approach
The ordering rule proposed by Feldman and Cousins [ 12] provides a Neyman confidence belt, as defined in Section 10, that smoothly changes from a central or quasi-central interval to an upper limit in the case of low observed signal yield.The ordering rule is based on the likelihood ratio introduced in Section 2.1: given a value θ 0 of the unknown parameter under a Neyman construction, the chosen interval on the variable x is defined from the ratio of two PDFs of x, one under the hypothesis that θ is equal to the considered fixed value θ 0 , the other under the hypothesis that θ is equal to the maximumlikelihood estimate value θ best (x) corresponding to the given measurement x.The likelihood ratio must be greater than a constant k α whose value depends on the chosen confidence level 1 − α: The confidence interval R α for a given value θ 0 is given by: and the constant k α is chosen in such a way that: This case is illustrated in Fig. 9. Feldman and Cousins computed the confidence interval for the simple Gaussian case discussed in Section 11.The maximum-likelihood value for μ, given x and under the constraint μ ≥ 0, is: Ordering rule in the Feldman-Cousins approach, based on the likelihood ratio.
The PDF for x using the maximum-likelihood estimate for μ becomes: The likelihood ratio in Eq. 38 can be written for this case as: The interval [x 1 (μ 0 ), x 2 (μ 0 )], for a given μ = μ 0 , can be found numerically using the equation λ(x|μ) > k α and imposing the normalization from Eq. 40, given the desired value of α.The results are shown in Fig. 10, and can be compared to Fig. 8. Using the Feldman-Cousins (FC) approach, for large values of x one gets the usual symmetric confidence interval.As x moves to lower values, the interval becomes more and more asymmetric, and at some point it is fully asymmetric, determining an upper limit.For negative values of x the result is always an upper limit avoiding unphysical values with negative values of μ.This approach smoothly changes from a central interval to an upper limit, yet ensuring the correct 90% coverage.

Frequentist upper limits on discrete variables
In the case of a discrete variable n, like for a Poissonian counting experiment, it's not always possible to find an interval {n 1 , . . ., n k } that has the exact coverage.In such cases, one as to take the smallest interval having a probability greater or equal to the desired CL.In this way the determined limit is conservative, i.e. the procedure ensures that the probability that the true value s lies within the determined confidence interval [s 1 , s 2 ] is greater or equal to CL= 1 − α (overcoverage).
In the simplest case where n = 0, we have: Since α, is equal to one minus the CL, we have: For values of α = 0.05 (95% CL) or α = 0.1 (90% CL), we have again the results derived in Sec.8: From the purely frequentist point of view, anyway, this result suffers from the flip-flopping problem discussed in Section 11: the choice to switch from fully asymmetric to central interval according to the observed result leads to an incorrect coverage, which can be fixed adopting the FC approach.In the Poissonian case, the 90% confidence belt obtained with the FC approach is shown in Fig. 12.The results in the case of no background (b = 0) are reported in Table 4.For different numbers of observed events n and different expected background b, the upper limits derived using the FC method are shown in Fig. 13.Comparing Table 4 with Table 2, which contains the Bayesian results, FC upper limits are in general larger, i.e.: less stringent, than Bayesian limits.In particular, for the case of n = 0, the upper limit increases from 2.30 to 2.44 for a 90% CL and from 3.00 to 3.09 for a 95% CL.But, as remarked before, the interpretation of frequentist and Bayesian limits is very different.

03003-p.19 EPJ Web of Conferences
A peculiar feature of FC upper limits is that, for n = 0, a larger expected background b corresponds to a more stringent, i.e.: lower, upper limit, differently from what happens to Bayesian limits that do not depend on the expected background for n = 0.This dependence on the expected amount of background is somewhat counterintuitive: imagine two experiments performing a search for a rare signal designed to achieve a low background level.If both measure zero counts, the experiment that achieves the most stringent limit is the one which has the highest expected background level!
The Particle Data Group published in their review [13] the following sentence about the interpretation of frequentist upper limits, in particular concerning the difficulty to interpret a more stringent limit if the expected background increases for the n = 0 case: "The intervals constructed according to the unified [Feldman Cousins] procedure for a Poisson variable n consisting of signal and background have the property that for n = 0 observed events, the upper limit decreases for increasing expected background.This is counter-intuitive, since it is known that if n = 0 for the experiment in question, then no background was observed, and therefore one may argue that the expected background should not be relevant.The extent to which one should regard this feature as a drawback is a subject of some controversy".
This feature of frequentist limits is often considered unpleasant by physicists.The need to come to an agreed procedure to determine upper limits, mainly triggered by the need to combine the results of the four LEP experiments on Higgs boson search, lead to the proposal of a new method that modifies the purely frequentist approach, as will be discussed in the following section.

Modified frequentist approach: the CL s method
The concerns about frequentist limits discussed at the end of the previous section have been addressed in the definition of a new procedure that was adopted for the first time in the combination of the results of the search for the Higgs boson [14] of the four LEP experiments, Aleph, Delphi, Opal and L3.The modification of the purely frequentist confidence level by a conservative correction factor can cure, as will be presented in the following, the counterintuitive peculiarities of the frequentist limit procedure.
The original proposal of the modified frequentist approach adopted a test statistics based on the ratio of the likelihood functions evaluated under two different hypotheses: the presence of signal plus background, and the presence of background only: Different test statistics have been applied after the original definition of the LEP procedure, but the remaining part of the method described in the following has been adopted mainly unchanged on the different kinds of test statistics.In the case of a simple event counting, assuming that the expected signal and background yields depend on the unknown parameters θ = (θ 1 , • • • , θ m ), the likelihood function only depends on the number of observed event n, and the likelihood ratio is: where L s+b and L b are Poissonian probabilities whose expected average are s + b and b respectively, and the signal and background yields s and b depend on θ.More explicitly, we can write: 03003-p.20

IN2P3 School Of Statistics, Autrans
Moving to the negative logarithm the above expression becomes: If we consider, in addition to the pure counting information n, a set of k measured variables x = (x 1 , • • • , x k ) that characterize each event, can write the ratio of extended likelihood functions as: where P(n|s + b) and P(n|b) are Poissonian probabilities as in Eq 51, and f s+b and f b are the PDF for signal plus background and background only respectively of the variables x.Explicitating the Poissonian terms and writing f s+b as as the superposition of signal and background compoments, similarly to Eq. 8, we have: where f s is the PDF of signal event.With a bit of math, we can rewrite Eq. 54 as: Moving to the negative logarithm, we have: In the case of a single parameter θ (i.e.: m = 1), one can plot − ln λ(θ) as a function of θ, and the presence of a significant minimum at θ = θ is an indication of the possible presence of a signal having a value of the parameter θ near θ within some uncertainty.If the background PDF does not depend on θ (for instance, if θ is the mass of an unknown particle) L b ( x|θ) does not depend on θ and the likelihood ratio λ(θ) is equal, up to a multiplicative factor, to the likelihood L s+b ( x|θ).Hence, the maximum likelihood estimate of θ is θ = θ, and the error on θ can be determined as usual in maximum likelihood estimates from the shape of −2 ln λ(θ) around its minimum, finding its intersection with an horizontal line at −2 ln λ( θ) + 1.
In order to determine the significance of the measured value of θ, if the conditions to apply Wikls' theorem [2] are valid (see Section 2.2), the value 2 ln λ(θ) can be approximated by a chi-squared.Hence, its value at the minimum: gives an approximate estimate of the significance Z.In Section 17 the interpretation of significance in the case of parameter estimates from data will be further discussed, and it will be clear that the estimate of significance at a fixed value of a measured parameter may suffer from a systematic overestimate (so called: look-elsewhere effect).

EPJ Web of Conferences
In order to quote an upper limit using the frequentist approach, the distribution of the test statistics λ (or equivalently −2 ln λ) in the hypothesis of signal plus background (s+b) has to be known, and the p-value corresponding to the observed value λ = λ has to be determined.The proposed modification to the purely frequentist approach consist of finding two p-values corresponding to the s + b and b hypotheses: From those two probabilities, the following quantity can be derived: Upper limits are determined excluding the range of the parameters of interest (e.g.: a particle's mass) for which CL s (θ) is lower than the conventional exclusion confidence level (typically 95% or 90%).For this reason, the modified frequentist approach is often referred to as the CL s method.
In most of the cases, the probabilities P s+b and P b are not trivial to obtain analytically and are determined using numerical Monte Carlo extractions, often referred to as pseudoexperiments, or toy Monte Carlo.In this way, CL s+b and CL b can be estimated as the fraction of tossed pseudoexperiments for which λ(θ) ≤ λ in the cases of s+b and b respectively.An example of the outcome of this numerical approach is shown in Fig. 14.This method does not produce the desired (90% or 95%, usually) coverage from the frequentist point of view, but does not suffer from the problematic features of frequentist upper limits that were observed at the end of Section 12.The CL s method has convenient statistical properties: • It is conservative from the frequentist point of view.In fact, since CL b ≤ 1, we have that CL s (θ) ≥ CL s+b (θ).So, it overcovers.This means that a CL s upper limit is less stringent than a purely frequentist limit.
• Combining several measurements can be performed by multiplying likelihood functions of individual channels to produce a combined likelihood function.This is an advantage from the technical point of view.Moreover, combining a measurement with a second measurement with low sensitivity implies multiplying λ by the likelihood ratio of the added channel which is close to one (the s + b and b hypothesis have similar values of the likelihood functions if the sensitivity to signal is low), hence the combined test statistics is not much different from the most sensitive measurement and the corresponding limit won't be much different from the one obtained using the most sensitive channel only.
• If no signal event is observed (n = 0), the observed limit does not depend on the expected amount of background.
For a simple Poissonian counting experiment with expected signal s and a background b, using the likelihood ratio of Eq. 52, one can demonstrate that the CL s approach leads to a result identical to the Bayesian one (Eq.20).And in general, it turns out that often numerically CL s are very similar to Bayesian upper limits computed with a uniform prior.But of course the meaning of Bayesian limits is very different.
Anyway, the interpretation of limits obtained using the CL s method is not obvious, and it does not match neither the frequentist nor the Bayesian approaches.It has been defined as [ 15]: "approximation to the confidence in the signal hypothesis, one might have obtained if the experiment had been performed in the complete absence of background."

Incorporate systematic uncertainties (nuisance parameters)
Some of the parameters in the set θ = (θ 1 , • • • , θ m ) are not of direct interest for our measurement, but are needed to model unknown characteristics of our data sample.Those parameters are defined nuisance parameters.Nuisance parameters may appear when the yield of the observed background is estimated with some uncertainty from simulation or control samples in data, or in the modeling of distributions of the observed variables in signal and background events, including the effect of detector resolution.The resolution needed to model the experimental width of a new particle's mass peak is an example of nuisance parameter.If we are only interested in the measurement of a signal strength μ, all parameters θ i are nuisance parameters.In case we are also interested in the measurement of the mass of a new particle, like the Higgs boson, the parameter corresponding to the particle mass, say θ 1 , is, like μ, a parameter of interest (sometimes referred to as POI) and θ 2 , • • • , θ m are nuisance parameters.More in general, let's divide the parameter set in two sets: the POIs θ = (θ 1 , • • • , θ h ) and the nuisance parameters, ν = (ν 1 , • • • , ν l ), where m = h + l.
The treatment of nuisance parameters is a well defined task under the Bayesian approach.The posterior joint probability distribution for all the unknown parameters can be defined as follows: where π( θ, ν) is the prior distribution of the unknown parameters and L( x; θ, ν) is the likelihood function.The probability distribution of θ can be obtained as marginal PDF, integrating the joint PDF over 03003-p.23 EPJ Web of Conferences all nuisance parameters: The problem is well defined, and the only difficulty is the numerical integration in multiple dimensions.Several algorithms can be adopted for this problem; a particularly performant algorithm in those cases is the Markov-chain Monte Carlo [16].
The treatment of nuisance parameters under the frequentist approach is more difficult to perform rigorously.Cousins and Highlands [17] proposed to adopt the same approach used for the Bayesian treatment to determine approximate likelihood functions for the signal-plus-background and background-only hypotheses.This hybrid Bayesian-frequentist approach does not provide an exact frequentist solution, but in most of the cases can be proven to be a very close approximation to the exact treatment.The hybrid likelihood functions can be written, integrating Eq. 9, as: In order to include detector resolution effects, for instance to model the width of a signal peak, the hybrid approach requires the convolution of the likelihood function with the experimental resolution function.
The above likelihood functions can be used to compute CL s limits, as it was done in the combined Higgs limit at LEP [14].
In the case of an event counting problem, if the number of background events is known with some uncertainty, the PDF of the background estimate b can be modeled as a function of the true unknown expected background b: P(b ; b).The likelihoods, as a function of the parameter of interest s and the unknown nuisance parameter b, can be written as: In order to eliminate the dependence on the nuisance parameter b, the hybrid likelihoods can be written as: In the most lucky case, for instance when P(b ; b) is a Gaussian function, the integration can be performed analytically [18].In this case, when the r.m.s of the distribution is not much smaller than b , P(b ; b) extends to negative values of b, and the integration includes unphysical regions.In order to avoid such cases, distributions whose range is limited to positive values is preferred.For instance, a log-normal distribution (the distribution of a random variable whose logarithm is distributed according to a Gaussian) is usually preferred to a plain Gaussian.

Profile likelihood
An alternative procedure to the hybrid treatment of nuisance parameters is to introduce the profile likelihood defined as follows: where μ and ˆ θ are the best fit values for μ and θ corresponding to the observed data sample, and ˆ θ(μ) is the best fit value for θ obtained for a fixed value of μ.Above we have assumed that all parameters are treated as nuisance parameter and μ is the only parameter of interest.
Usually the distribution of the profile likelihood function is broadened with respect to the original likelihood function, due to the loss of information introduced by the presence of nuisance parameters.
The profile likelihood cannot be treated as an ordinary likelihood function which depends only o μ, but has several interesting property that make it more convenient to use than the hybrid likelihoods, since it requires no numerical integration.In particular, being defined as the ratio of two likelihood functions, the Wilks theorem can be applied, in case of sufficiently large samples.In this case, the distribution of the test statistics q μ = −2 ln λ(μ) is asymptotically distributed according to a χ 2 with one degree of freedom (corresponding to the single parameter of interest not being profiled), and the significance corresponding to value of μ that minimizes q μ can be approximated as Z μ √ q μ [19].Different variation in the definition of the profile likelihood have been proposed and adopted by various experiment.A review of the main adopted procedures at LEP, Tevatron and LHC can be found in [20].In particular, for Higgs search at LHC the adopted test statistics is: Above, the constraint μ < 0 protects against unphysical values of the signal strength, while the cases which have an upward fluctuations of the data, such to give μ > μ, are not considered as evidence against the signal hypothesis with signal strength equal to μ, setting the test statistics to zero in those cases.For the definition of the qμ test statistics, as well for the most adopted variations of the profile likelihood, asymptotic approximations which extend the results of Wilk's theorem have been computed and are treated extensively in [4].As an example, the asymptotic approximation for the distribution of qμ is: where δ(q μ ) is a Dirac delta function, to model the cases in which the test statistics is set to zero, and where σ 2 = μ 2 /q μ,A , in which q μ,A is the value of the test statistics −2 ln λ evaluated on the so-called Asimov set [21], i.e.: a representative data set in which the yields of all data samples are set to their expected values and nuisance parameter at their nominal value.Asimov sets can also be used to compute approximate estimates of expected experimental sensitivity, which would require the extraction of a large number of pseudoexperiments.The square roots of the test statistics evaluated at Asimov A comprehensive treatment of asymptotic approximations can be found in [ 4].

The look-elsewhere effect
In several cases experiments look for resonances at unknown mass values.This is the case, for instance, of the Higgs boson search.If an excess of data compared to the background expectation is found at any mass value it can be interpreted as possible signal of the new resonance at the observed mass, but the peak could be produced either by the presence of a real signal or by a background fluctuation.The computation of the signal significance can be done from the p-value of the measured test statistics q assuming a fixed value m 0 of the resonance mass.This is called local significance, and can be written as: where f (q|μ) is the PDF of the adopted test statistics q for a given value of the signal strength μ.
The local significance gives the probability corresponding to a background fluctuation at a fixed value of the mass m 0 .The probability to have a background overfluctuation at any mass value, called global p-value, is larger than the local p-value, which underestimates the probability of a background fluctuation at any mass value in the range of interest, which would measure the global significance.The magnitude of the effect is larger as the mass resolution gets worse.In fact, assuming a small intrinsic width of the new particle, a very good mass resolution implies that a peak can appear from a background fluctuation if background events masses are by chance all close within the experimental resolution, which is less likely as the resolution gets smaller.
More in general, when an experiment is looking for a signal where one or more parameters θ are unknown (could be the mass, the width and other properties of a new signal) in the presence of an excess in data with respect to the background expectation, the unknown parameter (or parameters) can be determined from the data sample itself.In those cases, the local significance, expressed in terms of a p-value computed at fixed values of the unknown parameter set θ 0 is an underestimate of the global significance, which expresses the probability associated to a background fluctuation for any possible values of the parameter in the range of interest.The global p-value can be computed using as test statistics the largest value of the estimator over the entire parameter range: The distribution of q( ˆ θ) from Eq. ( 73) is not easy to determine, and usually intensive random extraction of pseudo experiments are needed.In order to determine a significance close to the discovery level of 5σ, p-values of the order of 3 × 10 −7 need to be evaluated, hence tens of millions of pseudo experiment representing background only needed to be extracted, and in many case this brute-force approach is intractable.
An approximate way to determine a global significance taking into account the look-elsewhere effect is reported in [22], relying on asymptotic behavior of likelihood ratio estimators.It is possible to demonstrate [23] that the probability that the test statistics q( m) is larger than a given value u is bound by: p glob = P(q( m) > u) ≤ N u + P(χ 2 > u) , where P(χ 2 > u) comes from the Wilk's asymptotic approximation of the distribution of the local test statistics q(m) as a χ 2 distribution with one degree of freedom, and N u is the average number of upcrossings, i.e. the average number of times the curve q = q(m) crosses an horizontal line at a given level q = u with a positive derivative.This is visualized in an example in Fig. 15.The value of N u could be very small, depending on the level u.Fortunately, a scaling law exists, so, starting from a different level u 0 one can extrapolate N u 0 as: This allows to evaluate N u 0 generating a number of pseudo experiment much smaller than what would be needed to determine N u with comparable precision. 03003-p.27

Figure 1 .
Figure 1.Probability distribution for a discriminating variable t(x) = x which has two different PDF for the signal (red) and background (yellow) hypotheses under test.Applying a selection cut, in this case t ≤ t cut , enriches the selected data sample of signal, reducing the fraction of background.

Figure 3 .
Figure 3. Examples of two-dimensional selections of a signal (blue dots) against a background (red dots).A linear cut is chosen on the left plot, while a box cut is chosen on the right plot.

Figure 4 .
Figure 4. Upper limits at the 90% CL (left) and 95% CL (right) for Poissonian process using the Bayesian approach as a function of the expected background b and for number of observed events n from n = 0 to n = 10.
where P(n b ; b) and P(n s ; s) are Poissonian probability distribution.It's easy to demonstrate that P(n; s, b) is again a Poissonian distribution with average s + b.Zech proposed to modify the first term, P(n b ; b), to take into account the observation of n events that would limit the possible values of n b from 0 to n.In this way, one would replace P(n b ; b) with P (n b ; b) = P(n b ; b)/ n n b =0 P(n b ; b) .

Figure 7 .
Figure 7. Graphical illustration of Neyman's belt construction for upper limits determination.

Figure 8 .
Figure 8. Illustration of the flip-flopping problem.The plot shows the quoted central value of μ as a function of the measured x (dashed line), and the 90% confidence interval corresponding to a choice to quote a central interval for x/σ ≥ 3 and an upper limit for x/σ < 3.

EPJ
Web of Conferences the upper limit s up such that: s up = min n m=0 P(m;s)<α (s) .

Figure 12 .
Figure 12. 90% confidence belt for a Poissonian process using Feldman-Cousins ordering, in the case of b = 3.

Table 4 .Figure 13 .
Figure13.Upper limits at 90% confidence belt for Poissonian process using Feldman-Cousins ordering as a function of the expected background b and for number of observed events n from 0 to 10.

Figure 14 .
Figure 14.Example of determination of CL s from pseudoexperiments.The distribution of the test statistics −2 ln λ is shown in blue in the signal-plus-background hypothesis and in red in the background-only hypothesis.The black line shows the value of the test statistics measured in data, and the hatched areas represent CL s+b (blue) and 1 − CL b (red).

03003-p. 25 EPJ
Web of Conferences data sets corresponding to the assumed signal strength μ can be used to approximate the median significance, assuming a data sample distributing according to the background-only hypothesis: med[Z μ |0] = qμ,A .

Figure 15 .
Figure 15.Visual illustration of upcrossing, computed to determine N u 0 .In this example, we have N u = 3.

Table 1 .
Significances expressed as 'Zσ" and corresponding p-values in a number of typical cases.

Table 2 .
Upper limits in presence of negligible background.

Table 3 .
Jeffreys priors for a number of typical PDFs. )