Statistical significance estimation of a signal within the GooFit framework on GPUs

In order to test the computing capabilities of GPUs with respect to traditional CPU cores a high-statistics toy Monte Carlo technique has been implemented both in ROOT/RooFit and GooFit frameworks with the purpose to estimate the statistical significance of the structure observed by CMS close to the kinematical boundary of the J/ψφ invariant mass in the three-body decay B → J/ψφK. GooFit is a data analysis open tool under development that interfaces ROOT/RooFit to CUDA platform on nVidia GPU. The optimized GooFit application running on GPUs hosted by servers in the Bari Tier2 provides striking speed-up performances with respect to the RooFit application parallelised on multiple CPUs by means of PROOF-Lite tool. The considerable resulting speed-up, evident when comparing concurrent GooFit processes allowed by CUDA Multi Process Service and a RooFit/PROOF-Lite process with multiple CPU workers, is presented and discussed in detail. By means of GooFit it has also been possible to explore the behaviour of a likelihood ratio test statistic in different situations in which the Wilks Theorem may or may not apply because its regularity conditions are not satisfied.


Introduction to GooFit
The word 'GPU-accelerated computing' refers to an enhancement of application performances that can be obtained by offloading compute-intensive portions to the GPU, while the remaining code still runs on the CPUs.The computing capabilities are enhanced once a sequence of elementary arithmetic operations are performed in parallel on a huge amount of data.In the context of High Energy Physics (HEP) analysis application, GooFit [1] is an under development open source data analysis tool, used in applications for parameter estimation, that interfaces ROOT [2]/RooFit [3] to the CUDA [4] parallel computing platform on nVidia's GPUs (it also supports OpenMP).GooFit acts as an interface between the MINUIT [5] minimization algorithm and a parallel processor which allows a Probability Density Function (PDF) to be evaluated in parallel.Fit parameters are estimated at each negativelog-likelihood (NLL) minimization step on the host side (CPU) while the PDF/NLL is evaluated on the device side (GPU) [6].Description and details about GooFit can be found elsewhere [1].In this study a comparison between RooFit and GooFit performances is presented when a huge amount of pseudo-experiments need to be fitted several times with the aim to estimate a p-value and thus the statistical significance of a new signal reconstructed from the data.The used hardware setup consists in two servers, one equipped with two nVidia TeslaK20 and 32 cores (16 + 16 by Hyper-Threading) and the other with one nVidia TeslaK40 and 40 (20 + 20) cores [7].
Figure 1.Fits to the background-subtracted J/ψφ invariant mass in the B + → J/ψφK + decay [8]; the significance of the left peak is estimated in this study.

Pseudo-experiments for p-value estimation and GooFit performances
In order to test the computing capabilities of GPUs with respect to CPU cores a high statistic toy MC technique has been implemented both in GooFit and RooFit frameworks to estimate the local statistical significance of the structure observed by CMS close to the kinematical boundary of the J/ψφ invariant mass in the three-body decay B + → J/ψφK + [8].The found fit parameters (Fig. 1) are compatible with the state Y(4140) observed for the first time by CDF [9].
MC toys are used to estimate the probability (p-value) that background fluctuations would -alone -give rise to a signal as much significant as that seen in the data.A single toy fit cycle consists in the following sequence of steps:  2. a Binned Maximum Likelihood fit (BML) is done with phase-space model (null hypothesis); 3. 8 BML fits are performed by adding to the phase-space a Voigtian model truncated to account for the kinematical threshold (alternative hypothesis); the Gaussian resolution function has fixed width (2MeV) and the signal yield is constrained to be positive.For each bin the PDF value is estimated by integration over the bin since the signal is steep with respect to the bin size.The 8 fits differ by the starting values (2 masses and 4 widths) within the region of interest defined from CDF values [9] (no need to take the Look-Elsewhere-Effect (LEE [10]) into account).
4. For each fit a ∆χ 2 value is calculated with respect to the null hypothesis fit and the best value is chosen among the 8 alternative fits.The ∆χ 2 (the test statistic) distribution is obtained over the whole sample of MC toys.
The speed-up for a GooFit single process with respect to a RooFit one is 64(48) for a TK40(20).The GooFit process does not exploit the whole computational power of a GPU ( 70%); the Multi Process Server (MPS) allows the execution of up to 16 simultaneous processes on the same GPU acting as a scheduler and allowing a balanced full use of GPU capabilities (each process uses one shared GPU and one exclusively assigned CPU).On the other hand PROOF-lite, a dedicated version of PROOF optimized for single multi-core machines [11], is used to efficiently run RooFit toys in parallel on the 72 CPUs available on the two servers.The speed-up of MPS (PROOF-Lite) with respect to one GooFit (RooFit) process as a function of the number of independent processes (workers) follows Amdhal's Law (Fig. 2); in both cases the parallelizable fraction is 0.97 i.e. the serial overhead is 3% of the application execution.
A first performance comparison can be obtained on the server hosting the two TeslaK20; using 2 GooFit/MPS jobs on the 2 GPUs and 30 CPUs (each running 15 processes) against one PROOF-Lite job using 30 CPUs, the speed-up is 45 and pretty stable, as expected, with respect to a strongly varying number of MC toys.
A second comparison can be carried out on both the servers hosting both types of GPUs as a function of the number of produced toys, but limited to 16 independent processes because of the MPS limit for the single TK40.Specifically the comparison is between one RooFit/PROOF-Lite job using 16 workers (on 16 CPU cores) and one GooFit/MPS job running 16 simultaneous processes on a single TK40 or TK20.When using a TK40(TK20) the speed-up has been measured to be 60 (40); the speed-up representing the gain within the same micro-architecture (TK40 vs TK20) is 1.5.All speed-up values are stable with the number of toys.
A third performance comparison can be done from the end-user's point of view.Assuming the analyst has at his own disposal the 2 servers used in these studies equipped with 3 GPUs and 72 CPU cores, it has been measured that 1M of MC toys can be produced in ∼ 11 days with RooFit/PROOF-Lite and ∼ 6 hours with GooFit/MPS.To get a signal statistical significance > 5σ a p-value < 3•10 − 7 is needed, namely a number of MC toys between 1M and 10M is needed.
The final ∆χ 2 distribution, f (∆χ 2 ) is shown in Fig. 3; the MC toys production was stopped once a fluctuation with ∆χ 2 > ∆χ 2 obs was found (Fig. 4).The p-value is estimated by: This corresponds to the statistical significance Zσ = Φ −1 (1 − P)σ 5.52σ, through the inverse function of the cumulative distribution of the standard Gaussian, that is compatible with the lower limit of 5σ quoted in [8] on the basis of 50.5M of MC toys obtained by means of RooFit.

Exploring the applicability limits of Wilks theorem
By means of GooFit it has also been easier to explore the (asymptotic) behaviour of a likelihood ratio test statistic in different situations in which the Wilks theorem may apply or does not apply because its regularity conditions are not satisfied.The Wilks theorem [12] is often used to estimate the p-value associated to a new signal.Given two hypothesis, the null one, H 0 , with ν 0 degrees of freedom (dof) and an alternative one, H 1 , with ν 1 dof, any test statistic t, defined as a likelihood ratio −2lnλ = −2ln(L H 0 /L H 1 ), or similarly (in the asymptotic limit) as a ∆χ 2 = χ 2 H 0 − χ 2 H 1 , approaches a χ 2 distribution with ν = ν 1 − ν 0 dof, provided that the following regularity conditions hold:  1. H 0 and H 1 are nested (H 1 includes H 0 ), 2. while H 1 → H 0 , the H 1 parameters are well behaving (well defined and not approaching some limit), 3. asymptotic limit (namely in the enough large data sample regime).
Once this theorem can be applied, the p-value associated to the signal is p = ∞ t obs χ 2 ν 1 −ν 0 (t)dt and the use of pseudo-experiments to estimate the p-value is not needed (even if still suggested).
When null hypothesis is background-only and the alternative is background plus signal, often the above conditions are not all satisfied, and the MC toys are mandatory.Indeed this is the case previously studied.The signal parameters in the model of H 1 hypothesis are: mass (m), width (Γ) and yield (µ ≥ 0); when H 1 → H 0 the problem is that not only m and Γ are not well defined but also µ tends to the null limit.This explains the previous use of a MC toys technique.
In general the distributions of a test statistic are not predictable and thus need to be extracted from pseudo-experiments.MC toys according to the previously discussed procedure and physics case have been generated for each of the following 4 cases: 1. m and Γ fixed, µ free; 2. m and Γ fixed, µ free but constrained to be positive; 3. m and Γ free, µ free; 4. m and Γ free, µ free but constrained to be positive.
The ∆χ 2 distributions for the four cases are shown in Fig. 5.The fourth case was the one studied so far (with much higher statistics).The first two cases have also been studied and are going to be discussed hereafter.Let us consider a likelihood ratio test statistic t µ = −2lnλ(µ), where µ is the strength parameter, as the basis of the statistical test.This can be a test of µ = 0 for purposes of establishing the existence of a signal process.In this first case, following [13], the PDF of the test statistic f (t µ |µ) = 1 √ 2π 1 √ t µ e −t µ /2 asymptotically approaches a χ 2 ν=1 distribution; this is in agreement with the Wilks theorem, with the difference of dof being one and represented by µ.
A fit to the test statistic distribution with a χ 2 ν model has been performed, where the likelihood ratio distribution was obtained by the already discussed fit procedure but when fixing the values of mass and width parameters to the CMS estimates previously obtained (alternatively they could have been fixed to CDF values), while leaving µ free.The best estimate obtained for the number of dof is ν 1.014 ± 0.001, thus close to the approximate theoretical prediction; the goodness of fit is checked using a chi-square test that returns a 11.8% probability.

Second case: m and Γ fixed, µ free but constrained to be positive
Let us consider the special case of the test statistic t µ with the purpose to test µ = 0 in a model where we assume µ > 0; rejecting the null hypothesis (µ = 0) leads to the discovery of a new signal.In this case, following [13], the test statistic is q 0 = −2lnλ(0) if the estimated signal strength μ ≥ 0 while is null otherwise, with λ(0) being the profile likelihood ratio for µ = 0.The authors of [13] derive analytically that an asymptotic approximation for the PDF of the statistic q 0 under assumption of the background-only (µ = 0) hypothesis is an equal mixture of a delta function at 0 and a chi-square distribution for one dof: A fit to the test statistic distribution with a model consisting in a linear combination of a χ 2 ν function and a narrow step function at zero has been performed (Fig. 6), where the likelihood ratio distribution was obtained by the already discussed fit procedure but when fixing the values of mass and width parameters to the CMS estimates previously obtained, while leaving µ free.The best estimates obtained for the number of dof and the coefficient in front of the step function are ν 0.992 ± 0.001 and ĉ 0.507 ± 0.001, namely close to the approximate theoretical prediction.A chi-square test returns a 3.5% probability for this fit.

Future developments
The GooFit MC toys application run by means of the MPS provides a striking speed-up with respect to the RooFit application parallelized on multiple CPUs by means of PROOF-Lite.The method can be extended to situations with an unexpected signal and a global significance must be estimated: the LEE [10] must be considered and a scanning technique must be implemented in order to consider relevant peaking behaviour with respect to the background model everywhere in the mass spectrum within the same fluctuation.This would increase the execution time of a cycle of fits performed on a single fluctuation.Different scan configurations should be tried to evaluate the systematic uncertainty associated to the scan.The RooFit-based approach would be unbearable for the long required processing time and the use of GooFit would be mandatory.

1 .
generation of fluctuated background binned distribution according to the 3-body phase-space model (the number of entries are fixed to that in the data thus ignoring Poisson fluctuations);

Figure 2 .
Figure 2. Amdhal's fit to speed-up perfomances of GooFit/MPS for a GPU TK40 and up to 16 CPUs.

Figure 3 .
Figure 3. Final ∆χ 2 distribution with one toy (shown in Fig. 4) exceeding the value observed in the data ( 53.0) marked by a red tick.

Figure 4 .
Figure 4. Bkg-only and bkg+signal fits to the MC toy characterized by a ∆χ 2 56.9.

Figure 5 . 1
Figure 5. Different test statistic (∆χ 2 ) distributions for the 4 cases discussed in the text, with the same number (2M) of MC toys.

Figure 6 .
Figure 6.Fit to the ∆NLL distribution of case (2).Fit model has two components: a very narrow step function and a χ 2 ν .