Nonparametric Tests for Purity of Low Statistics Data

The nonparametric methods are most suitable for tasks facing the uncertainty or complexity of models and the small statistics of the analyzed data. They are irreplaceable in those cases when high tempo of the analysis is required.


Introduction
Nonparametric methods for the analysis of experimental data are methods, which do not require too detailed parametric description of these data and, what is more important, a priori information about the values of these parameters. A general summary of the benefits of the non-parametric methods can be found in [1][2][3]. These methods are "short-cut statistics" [4], which means that they are easier for implementations and do not require a lot of input information, work faster and are more reliable, are robust, and insensitive to data statistics.

Ratio median/mean as test for purity
Let a distribution A = t i , i = 1, 2, . . . , m (e.g., data of radioactive decay) be given. We consider the purity as a uniqueness of the distribution function for the given sample. We want to check whether it satisfies, e.g., a single exponential distribution F(t) = 1 − exp(−t/T ), t ∈ [0, ∞) with unknown parameter T , or a mixture of other distributions, possibly of the same type, but with different values T . We are not interested in the value of T , but if A is "pure", obtaining an estimate of T is a trivial statistical problem.
Thus, we set up a test -build a function c(t i ), which has the following properties: • its distribution density is significantly different from zero in some region R of variables t i ; the best case is when the mathematical expectation of function c(t i ) is close to a point where this function has a bell shape; • its density in the case of the single distribution is significantly different from the density function for a mixture of distributions just in the region R.
Here, a function M m which is the ratio of the sample median m d to the sample mean m n is taken as such a test, e-mail: zlokazov@jinr.ru and below we will look at its operation in the two important cases: that of the exponential density and that of the normal one where T and a denote expectations and σ is the standard deviation (square root of the variance).
• The sample mean m n = m i=1 t i /m is an unbiased and efficient estimate of the parameters T (exponential case) and a (normal one) provided A is "pure" and has the unbiased variance • The sample median m d is defined according to the following algorithm: -arrange all the t i in increasing order and let j denote the integer part of m/2; The sample median is asymptotically unbiased and efficient when m grows.
Theoretically our test M m has a distribution which is similar to the gamma one, getting more symmetric while m → ∞ (in expo case) and similar to the Cauchy distribution (in case of the normal distribution). In

The distributions of the M m test
However, in the case of data distributions of a finite size m, the distribution functions for medians have no compact closed form so that we must restrict ourselves to Monte-Carlo simulation of M m distribution for different m.
In the Figs. 1, 2 the sample distributions for both M m are plotted on the basis of 10 6 event samples for m = 30. In [5] this m is called a good representative of the low data statistics.

Confidence intervals for the M m test
A test for exact data would point to the expectation M m = E. However, for random data, M m is random and will be scattered around E over a certain interval [E − δ 1 , E + δ 2 ] -the confidence interval (CI). Such an interval reflects the degree of our "confidence" to the ability of this test not to reject the right hypothesis and not to accept a false one.
As such a CI we should take the smallest possible interval [E−∆ 1 , E+∆ 2 ] covered by the maximum probability integral Unfortunately, in the general case this variational problem has no solution. We must make use of some empirical compromise between the two requirements. In the world practice it is customary to take the 68 % probability integral and the corresponding interval as CI. Smaller values of this integral enhance the chance to commit the first error, the greater to make the second one.   Given a data sample A, two types of hypotheses testing may be formulated: • test whether the data does not contradict the hypothesis about its purity; • test whether the data confirms the hypothesis H 0 about purity against some opposite hypothesis H A ; In other words we are interested in getting answer to the question: is this data a sample from a single probability density f (t) or (in a simple approach) from the sum of several densities f j (t) with different weights a j , n j=1 a j · f j (t), j = 1, . . . , n; n j=1 a j = 1.
If the values t i are produced by only one term of interest (let it be k-th density) then the data is "pure" and only one weight a k will be nonzero. Otherwise, the sample is a mixture. Let [c 11 , c 12 ] be the confidence interval (of the exponential or normal distribution) for m -the size of data A. Then the first problem is solved calculating M m and checking whether it falls inside [c 11 , c 12 ]; when it does, the data may be pure at the corresponding significance level.
For the second case we need two hypotheses: Suppose that on the coordinate axis they can be illustrated as shown in Fig. 3. Then: t 6 c 11 6 c 12 6 c 21 6 c 22 , it means that the data does not contradict H 0 (the data is pure), but contradicts H A (the data is a mixture); thus, confirms H 0 .
• If M m falls inside the range [c 12 , c 22 ], it means that the data does not contradict H A (the data is a mixture), but contradicts H 0 (the data is pure); thus, confirms H A .
• If M m falls inside the range [c 21 , c 12 ], it means that the data does not contradict both H 0 and H A . It is a failure of the test. One needs the data with a larger statistics to diminish both confidence intervals and, thus, the size of their overlap.
• The case M m < c 11 or M m > c 22 means also a test failure and one cannot make conclusions for a sample data with given statistics.

Conclusions
It has been shown that the proposed nonparametric "median/mean" criterion for testing a rather large class of data distributions for purity is suitable for use. This is especially important for the small data statistics and the lack of the a priori information about the parameters of the distribution function.