EPJ Web of Conferences 55, 03002 (2013)

Unfolding is the ensemble of techniques aimed at resolving inverse, ill-posed problems. A pedagogical introduction to the origin and main problems related to unfold- ing is presented and used as the the stepping stone towards the illustration of some of the most common techniques that are currently used in particle physics experiments.


Introduction
The problem of recovering the "true", untarnished distribution of the values for a given variable from "smeared", biased and inefficient observations is common to a variety of fields.Figure 1 shows an example from medical imaging [1].Positron Emission Topography (PET) aims at visualizing the blood flow and metabolic activity in an organ by introducing a positron-emitting radioactive material (tracer) and by detecting X-ray photons from electron-positron annihilation (e + e − → γγ) when positrons emitted by the tracer annihilate with electrons from the surrounding organic matter.The reconstruction the photon emission spatial density from the detected counts provides the organ's image and it requires inverting the process outlined in Figure 1.
Images are often blurred by detector effects and corrupted by the presence of additional random noise [2].Inverting the process shown in figure 2 is necessary to recover the details of initial image.
A variety of particle physics measurements share the same "imaging" goal.An illustrative example is the invariant mass of the pair of top-antitop quarks produced in proton-proton (pp) collisions at a center-of-mass energy ( √ s) of 7 TeV at the Large Hadron Collider (LHC) [3] (pp → t t + X) .This is reconstructed by measuring a complex final state involving jets of hadrons and leptons with a multi-purpose detector (in this example the ATLAS detector [ 4]).The inversion of the measuring process, shown in the cartoon of figure 3, is required to correct for detector (and sometimes acceptance) effects and recover the distribution of the underlying physical property.In particle physics unfolding is the ensemble of statistical techniques used to solve what is defined as the inverse problem: infer an unknown distribution f (y) for a variable y from the measured distribution g(s) by using knowledge and/or assumptions on the probability distribution that links the observation to the "true" value.Other terms are used that give somewhat more emphasis to recovering one specific feature of the degraded information so the techniques are also named unsmearing or deconvolving.The proper mathematical formulation of the inverse problem is the crucial step to understand the challenges involved in the unfolding procedure.Scheme for the evolution of the medical imaging process using figures from Ref. [1].The simulated photon pair emission density representing the brain (left, Figure 2) is passed to a simulation of the Positron Emission Tomography (center, Figure 1a) that produces the "observed" count distribution from the photon detector (right, Figure 3a).The names of the figures are as they appear in the reference.
Figure 2. Scheme for the evolution of blurring and degradation of a two dimensional image using figures from Ref. [2].The "true" simulated two-dimensional image (left, Figure 4a) is degraded by convoluting it with a Gaussian "spread" function with the addition of random Gaussian noise (see Section 4 in Ref. [2]) to produce the "observed" image (right, Figure 5A).The names of the figures are as they appear in the reference.

Unfolding foundations
The mathematical foundations of unfolding are intimately related to the description of the inverse problem [10] provided by the Fredholm integral equation of the first type g(s) = Ω K(s, y) f (y)dy (1) where the true f (y) distribution of the variable y = (y 1 ,..,y J ) is related to the measured or observed distribution g(s) of the variable s = (s 1 ,..,s L ) by the convolution with the kernel function K(s, y) [ 11].
In general the variables y and s belong to multidimensional spaces with different dimensions so the two integers J and L are different, in principle.The "volume" Ω represents the support of f (y) i.e. the subspace of the multidimensional space where y is defined.The distribution f (y) is transformed into the reconstructed distribution g(s) generally because of limitations in the reconstruction of the data (biases), non-unitary and non-uniform efficiency in their collection and resolution effects.Given the random nature of both the values of the variables to be observed and of the effects that limit their observation, retrieving f (y) is a statistical estimation problem and the estimator needs to  (10, right) in Ref. [5] (shown with the inclusion of theoretical uncertainties) for events featuring top-antitop quarks produced in √ s = 7 TeV pp collisions at the LHC is reconstructed (right, Figure 1(a) from Ref. [6] ) after the top quark decay products are measured by the ATLAS detector (middle, image from Ref. [7] ).A diagram from Ref. [8] shows the final state partons from the top quark decay at leading order.The names of the figures are those by which they appear in the references.
be assessed for consistency, bias, efficiency, and robustness [ 11,12].If f (y, a) exists such that it is a prediction for f (y) expressed as a function of a set of parameters (a 1 ,..,a P ), the convolution of f (y, a) with K(s, y) can be compared with the observed distribution g(s) to extract the parameter vector a from the data (for instance by means of a fit) and provide a complete description of f (y).However if no parametrized prediction for f (y) exists (as it is often the case), different techniques need to be used to estimate f (y) from g(s).Operatively the measurements that sample g(s) are limited in number and affected by biases, inefficiency and imperfect resolution, so a discretized version of the integral equation 1 is used and a limited number of ingredients define the unfolding problem [ 13].
In the very common one dimensional case where both y and s are real variables, the measured distribution is approximated by the histogram representing the values ν i , the expected number of counts in a given interval of s according to the definition where the interval of definition for s is divided in N sub-intervals by a set of (s 1 ,...,s N ) values and any integral of g(s) over a specified sub-interval provides the total number of observed events in that sub-interval.In a similar manner the true distribution is approximated by a histogram.The range of the allowed values for y is also divided in M sub-intervals by a set of (y 1 ,...,y M ) values and the expected number of counts in one of the sub-intervals is defined as The integral kernel K(s, y) from Equation 1 is approximated by a response matrix R(i, j) representing the probability that an event with a value of the y variable in bin j is observed as an event with 03002-p.3 a value of s in bin i.So Equation 1 is transformed in where ν i and μ j are the expected number of reconstructed and "true" events in bins i and j respectively.Consequently the first ingredient for the unfolding problem described by Equation 4is the knowledge of the response matrix R. In general R is a rectangular matrix and by combining Equation 1with Equation 2, it is connected to the kernel by the equation If the analytical formulation of the kernel is available, R can be determined directly from Equation 5. However most frequently R is obtained by running detailed simulation of the measuring apparatus including as many effects as possible.Monte Carlo events are generated with the best available prediction for the true distribution f (y) and fully simulated with the most accurate model of the detector to produce our best guess of g(s), the distribution of measured values.For some cases it is possible to measure the response to δ-like (unit-impulse) inputs that can allow to determine the kernel in a certain range of values, like the response of a calorimeter to a beam of particle of known energy and nature.This is equivalent to the integral K(s, y 0 ) = b a K(s, y)δ(y − y 0 )dy.The second ingredient is the the vector of expected bin contents νν ν.The vector ν ν ν is approximated by the vector n = (n 1 ,...,n N ) representing the number of observed events in each histogram bin for the variable s.By definition ν ν ν is such that E[n i ]= ν i where E[n i ] indicates the expectation value of n i .
Two additional ingredients are necessary to make the model built in 4 closer to reality.First some interesting events are not observed due to inefficiencies in the detection or to the requirements imposed on the events properties.Such inefficiency is included in the estimate of the response matrix R(i, j) with a proper normalization by defining R i, j = M j=1 P(observed in bin i|true value in bin j) = P(observed anywhere|true value in bin j) = j (6) where the vector = ( 1 ,.., M ) describes the detection efficiency as a function of the histogram bin.
Secondly some of the observed events are not interesting for the measurement one wants to perform as they are due to backgrounds (events that look like the ones of interest, but have different origin) and they modify the observed distribution.Such events have their own distribution b(s) in terms of the values of the observed variable s.The vector β β β of the expected number of background events in each bin of the histogram of s can be defined as Examples of histograms [13] featuring the vectors μ μ μ, and the corresponding vectors n and ν ν ν are shown in figure 4.
In general the model described in Equation 1 is then extended to and its discretized one-dimensional form described in Equation 4 is consequently extended [13] to whose vectorial compact form is

The maximum likelihood solution
Given the problem described by Equation 10, the formal solution is written as where R −1 is the inverse of R.This estimate for μ μ μ can also be derived from the principle of maximum likelihood (ML) [14].If one assumes (fairly generally) that events are being counted in each histogram bin and that the data are consequently independent Poisson observation distributed according to the logarithm of the global likelihood L = N i=1 P(n i |ν i ) resulting from the Poisson assumption is where ν ν ν = ν ν ν(μ) μ) μ) because of equation 10.Consequently the maximum likelihood estimator for ν ν ν obtained by imposing ∂logL(μ i )/∂μ i = 0 ∀ i is given by and consequently the estimate of μ μ μ is obtained as Is this solution always working ?An example shown in Ref. [ 13] reports a double-peaked true distribution for which the resulting ML estimate, derived according to equation 15, shows a multipeaked shape with extremely large variances and very large anticorrelation between neighbouring bins: the estimate turns out to be very different from known input.The response matrix R for this example has sizeable non-diagonal elements and the bin size of the histogram to be "inverted" is smaller than the detector resolution encoded in the model for event migrations.Figure 5 shows the generated "true" histogram μ μ μ, the observed histogram (dashed) and the corresponding expectation values (solid) and the estimator μ est μ est μ est .What is happening?Insight into the reasons for the ML result can be obtained by considering an instance where the true μ μ μ have a fine structure and the detection effects, represented by the response matrix R, dilute the true information while allowing residual structure to be present [ 13].This is shown in figure 6.The application of R −1 aims at restoring the original histogram, according to Equation 15.If the migrations are properly modelled, the inversion returns the correct values if the input data are the expectation vector ν ν ν of the reconstructed bin contents.However the matrix inversion is applied to one instance of the vector n, it is not applied to its expectation value ν.As a consequence, in a suggestively descriptive way, R "assumes" that the fluctuations in n are the residual of a real original structure diluted by the detection effects (and not of statistical origin) and uses the given input and the available model for migrations to reconstruct μ i.e. it magnifies the fluctuations back into the result.
Independently of the large fluctuations induced by the application of the matrix inversion the maximum likelihood solution is an unbiased estimator of μ μ μ because with a covariance given by where δ k,l is the Dirac δ symbol 1 and the equality cov[n k , n k ] = ν uses the property of Poisson distributed data according to equation 12.
If this equation is inverted2 the minimum variance equals the ML variance obtained in Equation 17 i.e.U i, j = U min,i, j .Consequently the ML solution provides the unbiased estimator with the smallest variance.As a consequence estimators providing an additional reduction in variance with respect to the ML estimator will necessarily introduce a bias in the estimate of the true distribution.The balance between bias and variance is a crucial item in the unfolding procedure.Understanding the origin of the large fluctuations in the ML estimator allows to develop techniques to reduce the fluctuations (and consequently the variance of the estimator) while understanding the limitations in terms of bias of the estimator.

Correction factors: a "diagonal" ML
A simple step towards a small variance estimator consists in a simplification of Equation 15 derived by taking the same binning for μ μ μ and ν ν ν and assuming R to be diagonal (no migrations of events between bins when transforming the true distribution into the measured one).The resulting estimate for μ μ μ is where C i are correction factors (often called "bin-by-bin corrections") and they are usually derived from full simulation of the process under investigation.This provides an estimate for the expected number of reconstructed events μ MC i and true events ν MC i and the correction factors are simply derived as The corresponding covariance matrix is estimated [ 13] to be The correction factor C i is often of order unity so the variance of the estimators is not much larger than the Poisson statistical uncertainty in the data and it is typically reduced with respect to the ML estimator uncertainty.In relation to the uncertainties in Equation 21 a simple example due to R. Cousins and reported in Ref. [15]) points out their limitations.If one assumes that, for a given bin i of the distribution to be corrected, the values are C i = 0.1, β i = 0 and n i = 100, the estimate μ i,C for the expected number of events in this bin is obtained by C i n i = 10 and the associated standard deviation is C i √ n i =1.However this estimate maintains that only 10 of the 100 events that are observed in the bin are actually belonging to the bin, while the remaining 90 events migrated in from other bins.It is then contradictory to have a measurement with a 10% uncertainty when there are in fact only 10 events that are actually carrying information about the bin content.
The bias corresponding to this technique, defined as E[μ i,est ]-μ i , is estimated [13] to be b = ( where The bias b is zero only if the simulation provides a proper description of the (unknown) true distribution and the bias pulls the result towards the values derived by the model that is used to determine the correction factor.
Ultimately the values of C i depend circularly on the assumed true distribution one is trying to find.In addition the bin-to-bin correlations are completely neglected and uncertainties are only diagonal.The sum of the estimated events can be different from the sum of the observed number of events, differently from the ML estimator.The reduction in statistical uncertainty is obtained in exchange for a bias on the estimated result and the actual estimate of the bias is not simple.The bias is reduced if the migration between bins are a small fraction of the bins contents i.e. if the non-diagonal elements of the response matrix R are much much smaller than unity.Another visualization of this reduction is the requirement for the bin width to be large compared to the measurement resolution.Given its limitations in terms of possibly large biases, the technique of correction factors is a good tool for an initial approximation of the results, but it is generally advisable to avoid it for general use 3   5 Back to basics: where to from the maximum likelihood solution?
The sensitivity to fluctuations associated with the ML solution stems from the nature of equation 15 : the original Fredholm equation 1 is an intrinsically ill-posed or improper problem [10] i.e. a problem where "large and sometimes infinite changes in the solution could correspond to small changes in the input data" [16]  4 In this light the stability of the solution of Equation 15 with respect to fluctuations can be quantified by how the uncertainties on the inputs are propagated to the output: a quantitative figure of merit for this propagation is the maximum ratio of relative precision of the estimated solution μ μ μ est of Equation 15 to the relative precision of the measured input vector d = n -β β β, defined as The quantity c(R) is called the condition of the R matrix and it is the upper bound on the magnification factor for the uncertainties on the input to the inversion.A large value for c(R) implies instability under small fluctuations in the input i.e. a significant sensitivity to "noise" in the measurement.
A deeper analysis of equation 15 illustrates the link between fluctuations and instability and exposes the origin of instability in a quantitative manner [ 17] by making a connection with the condition of the matrix to be inverted.
The first step is to perform a transformation of variables in equation 15 such that the covariance matrix V d of the vector d becomes the identity matrix.In general V d can be non-diagonal as there can be correlations between the observations in the different bins: the Poisson-based likelihood for independent observations described by Equation 12is consequently extended to be and the estimates deriving from its maximization coincide with the least squares estimate 5 .The reduction of V d to the identity matrix allows to write the generalized likelihood of Equation 24 in terms of significances i.e. variables normalized to their uncertainties.The transformation of variables 3 A possible exception can be some very well behaved cases with nearly diagonal response matrices where migrations effects are minimal, the expected uncertainties are well understood and the expected bias is found to be negligible in comparison to the total final uncertainties on the unfolded results (see also Section13). 4 A simple and powerful visualization of the ill-posed problem is also given in Ref. [ 10]: given that the kernel integration in Equation 1 tends to smooth out f (y) and to reduce its high frequency components (edges, cusps and the like), the inversion of such a procedure will inevitably enhance the high frequency features of the input. 5In the limit of large expected number of events each independent Poisson variable described in Equation12 tends to a Gaussian with the same mean and variance so the resulting likelihood L will tend to the diagonal multivariate Gaussian 2 , the uncertainly on y i , and D d,i, j = 0 for i j (see chapter 4 of [18]).A non-diagonal multivariate Gaussian likelihood will include correlations.An example of correlated variables is given in the case where the total number of events is a fixed quantity and the bin contents of a histogram are correlated and are distributed according to a multinomial distribution.In the limit of large number of observed and expected events in each bin, the multivariate generalization is a multivariate Gaussian [ 18]. is a rotation in R N followed by rescaling.The matrix V d is symmetric and positive definite so there exists an zero and V d,i, j = 0 for i j.The new vector d is obtained by a rotation with Q and a rescaling based on v i as follows The new rotated and normalized d vector encapsulates the statistical significance of the inputs (i.e.their size in units of their uncertainty) : it takes into account the different statistical power of the equation associated to each of the N input values (see Equation 9) .The new R matrix is also redefined accordingly so that equation 11 is reformulated in terms of the new variables as and the sum of squares to be minimized equivalent to the maximum likelihood is simplified to The second step is to expose the decomposition of the ML solution in terms of parameters that measure the sensitivity to fluctuations in the input [10].Such parameters can also be related to the size of the migrations described by R (see Section 4 of Ref. [19]) i.e. the resolution and acceptance performance of the available instruments.This is done by performing a singular value decomposition [ 20] (SVD) of R .In general a matrix R of dimensions M × N can be decomposed as where U and V are unitary matrices ( The σ i values are called singular values of the matrix R , they are non not negative and can always be arranged in non-increasing order [ 10].Both matrices U and V can be written in terms of their column vectors: If R is replaced by its singular value decomposition in the inversion and σ j 0 ∀ j the result is The singular values σ i have important properties to characterize the unfolded result.The smoother the kernel corresponding to R (i.e. the higher order continuous partial derivatives it has), the faster the decay to zero of the singular values σ i is found to be; the smaller the value of σ i becomes, the larger the frequency turns out to be for the component σ i corresponds to (i.e. the more oscillations are present in the functions the corresponding kernel is decomposed in) [ 10].The coefficients c i = u T i d can be ordered by decreasing value and they decrease rapidly with the increasing index i [ 21].In addition the vector c = (c 1 ,..,c N ) has unitary covariance matrix V c = 1 because it is obtained by multiplying the 03002-p.10unit-covariance d by the orthogonal matrix U T .These normalized coefficients encode the significance of the corresponding contribution to the ML result.The contribution of each c i is weighted with the inverse of the corresponding singular value σ i : small singular values can generate large fluctuations in the final ML result [21].
The quantitative connection between the singular value decomposition and the magnification of uncertainties in the unfolded result can be found in the condition c(R ) : this can be re-written as and it can be shown [22] that where ||d|| is the norm of the vector d resulting from the Euclidean positive definite metric in R N .For the matrix R , the norm ||R || is induced by the Euclidean norm.If A:R N → R N is a linear application with the Euclidean norm for a vector ||x|| = ( i x 2 i ) 1 2 defined for both R N and R M , the norm of the matrix A is defined as max eigenvalue of A T A. So the condition of the matrix R can be read off from its singular value decomposition that is connected to the sensitivity to fluctuations in the unfolding problem.
The overall picture is now clearer.The singular value decomposition gives insight into the unfolding problem: ML estimators are sensitive to small effects that can lead to large changes in their values.Once the problem is described in terms of uncertainty normalized variables, the large sensitivity to small fluctuations (i.e.high frequency components, in Fourier-like language) can be derived from the high condition number c(R) for the response matrix that describes the unfolding problem.In order to pose the problem more properly, it is then necessary to reduce the the impact of the low significance, highly oscillating input components while preserving the information available in the remaining high significance, more stable components.The problem is then said to have been "regularized".As the ML estimator is unbiased according to the discussion of Section 3, regularization inevitably leads to accepting a certain level of bias in exchange for a reduced variance.The bias is defined as the difference between the expected value of the unfolded result and the true unmeasured expected value.The heart of unfolding problems lies in understanding the balance between bias and uncertainty.

Regularized unfolding: a general view
The likelihood formulation of the unfolding problem in Equations 13 and 24 quantifies the distance between the data vector n and the expectation vector ν ν ν.According to that distance, in a neighbourhood of the ML solution in R N the values of μ μ μ are such that logL(μ μ μ) ≥ logL max − ΔlogL (33) In order to filter out a certain amount of the high frequency components of the input and alleviate the sensitivity to large fluctuations, this distance definition can be modified with the goal to single out a modified solution that is still "close" to the unbiased ML estimate, but less sensitive to fluctuation.A transparent way to carry out such modification is to impose constraints on the initial likelihood by adding Lagrange multipliers and describing the regularization as a maximization procedure for a new likelihood φ.
The logarithm of the new likelihood to be minimized then becomes where L(μ μ μ) is the initial likelihood (for instance from either Equation 13or Eq. 24), S (μ μ μ) is called regularization function, α and τ are the regularization parameters that allow to tune the strength of the constraints (equivalent a special choice of ΔlogL).In addition, it is possible to add the constraint that n tot = N i=1 ν i if the solution is required to provide an unbiased estimate of the total number of events.This results in the maximization of as a function of λ and μ μ μ.It should be noted that N i=1 ν i is a function of μ i as ν i = N i=1 R i, j ν j + β i .The regularization function is often perceived as a measure of the level of "smoothness" required of the maximum likelihood solution.In this light, taking for instance the formulation of Equation 34, if α is set to zero, the solution is set to the smooth function encoding all the constrains (i.e.available pre-existing information): the shape of S (μ μ μ) is imposed as the correct one and the data are ignored.If α tends to infinity (i.e.α is much larger than any of the other coefficients) S (μ μ μ) carries no weight in the maximization and the ML solution is re-obtained.
In the explicit formalism the ingredients for the regularization of a given likelihood L(μ μ μ) are the regularization function S (μ μ μ) and a prescription for α to tune the level of filtering for the high frequency components of the input.

Regularized unfolding: the Tikhonov scheme
An analytic and quantitative measure of the smoothness of the unfolding solution is the mean square of the k th derivative proposed by Tikhonov and Arsenin in Ref. [ 23].The proposed form for the regularization function S is then with k in an integer number.If k = 2 is chosen, Equation 37 can be approximated by a sum over the numerical estimate of second derivative [24] S (μ where M is the number of values used to describe the regularization function or the number of bins used to provide its discrete description.In matrix notation it is possible to re-write S (μ μ μ) as where C is the M × M matrix that encodes the definition of the second order numerical derivatives (see Section 6 in [19]) 6 .
In the limit of large expected and observed number of events for the distribution of interest the logarithm of the likelihood to be maximized results from combining Equations 24, 34 and 33 into 6 In general a different form for C allows to use a different regularization function that is also quadratic in μ μ μ. 03002-p.12 The combined likelihood φ(μ μ μ, τ) is a quadratic function of μ given the definition of χ 2 in Equation 24and of S (μ μ μ) in Equation 38.Consequently the first partial derivatives with respect to μ and τ to be solved to minimize φ(μ, τ) return a system of linear equations.
Similarly to Section 5 the first step is a linear transformation such that the input variables d are normalized to their uncertainties and their new covariance matrix is 1 ∈ R N (diagonal with unitary elements).Consequently the value of − 1 2 χ 2 (μ μ μ) takes the form reported in Equation 28.The minimization of Equation 40 is then equivalent to finding the solution to the problem represented as which can also be re-written as As third step the product R C −1 can be expanded by a singular value decomposition like in Equation 29 and the solution of the problem can be written in terms of such expansion like in Equation 30.The major difference is the presence of the τ-dependent constraint.In order to incorporate τ in the solution of Eq. 42, a special linear transformation is performed, called Givens rotation [ 25] : this coordinate transformation sets the lower diagonal block proportional to τ to zero while transferring the information about the τ variable to the upper block.In this way the solution can be expressed as a function of τ and of the solution for τ=0 [ 19,21].The final result [21] is with φ i (τ)= The small values of σ i are now regularized by the presence of τ so that they do not cause large fluctuations.The τ parameter acts like a cut off for a low pass filter to single out the highest frequencies causing the most rapid fluctuations.When σ i is much smaller than τ the coefficient φ i (τ) behaves like σ i /τ instead of behaving like 1/σ i (see Equation 30) so φ i tends to zero instead of tending to infinity and the impact of these "frequencies" is drastically reduced.It is the additional assumption on the smoothness of the solutions that reduces the importance of the solutions that result from highly oscillating solutions.While the C matrix is set by the assumption on the derivatives, the value of τ is optimized by the properties of the problem at hand.The choice of the value for the τ parameter is discussed in detail in Sections 4.3 and 7 of Ref. [19].Here we report just the salient concepts.The reduction of the impact of higher-frequency, less statistically significant, more noise-like components is a powerful criterion for the choice of τ.The significance of each component can be read off from the coefficients of equation 43 similarly to the general discussion in Section 5. Reference [19] uses the components of the covariance-normalized, SVD-rotated input vector w defined as The components w i of w that are of order unity or less are considered to represent statistically insignificant contributions and given that the impact of the components of σ i larger than τ is suppressed according to equation 43, setting τ equal to the value of σ 2 k for k such that w k 1 is one way to 03002-p.13 meet the chosen criterion.Additional optimization for the choice of the value of τ is possible 7 , for instance by using the τ that gives that best least-squares-based comparison between a generated and the unfolded version fully simulated model for the problem under consideration.Figure 7 shows an "academic" example [19] of unfolding simulated events with this instance of Tikhonov regularization: it reports the response matrix, the superposition of the true, reconstructed and unfolded distributions, the distribution of the w i values and the difference between the true and the unfolded result.In the technical implementation of this version for Tikhonov regularization, a correction function is used to scale the simulated truth shape of the truth distribution to obtain the unfolded result.The minimization constraint is actually imposed on the curvature of this shape correction function.As a consequence the normalization information is kept in the response matrix which is no more probability-base like in Equation 4, but it is related to the actual number of simulated events so that the statistical uncertainties on the knowledge of the matrix elements are properly kept into account in the final result.

Iterative unfolding
A different approach to including information about the expected true distribution to counter the enhancement of fluctuations is obtained with an iterative approach.
As pointed out in Ref. [26], the initial idea of an iterative technique for solving ill-posed problems dates back to at least 40 years ago [28].The general description outlined by Ref. [ 28] provides a still valid basis to understand the core concepts of a large number of the iterative schemes used at present.The inspiration is derived from Bayes' theorem [ 29] where the sets involved in the formulation are named obs and true to hint respectively at the observed and true number events associated to a given property 8 .In this light Bayes' theorem can be written as where P(x|y) is the conditional probability that a variable x has a certain value given the value of the variable y is between y and y+dy, f (x) is a probability density for x and g(obs) is defined as Inverting equation 45 and using the normalization properties of P(x|y), one obtains Equation 47 looks like the "inverse" of 46: it should be noted that P(true|obs) is actually a function of f (true) itself.The proper inverse kernel for f(true) needs to be a function of P(obs|true) only.
Equation.45 provides the ansatz that if an initial hypothesis is made on f (true), it is possible to use P(obs|true) estimated from simulation to evaluate P(true|obs) by defining g(obs) as the convolution of f (true) and P(obs|true).The estimate of f (true) can be re-used as initial hypothesis for an updated estimate and the procedure can be iterated.So in the the r th iterative step, using Equation 46, g r (obs) is defined as g r (obs) = f r (true)P(obs|true)dtrue (48) Using the ansatz of Equation 47, the estimate P r (true|obs) is then convoluted with the observed g(obs) data that is estimated from data.So a new estimate for f (true), the starting point for the (r + 1) th step, is then obtained by using Equation 49 as follows: The iteration ends at the step r for which modifications to the value of f r+1 (true) introduced by additional steps are smaller than a given tolerance value.An equivalent statement is that the iterative scheme converges if there is a step r such that g(obs) data = g r (obs) [28].The integration in Equations 50 will tend to remove deviations from unity in g(obs) data /g r (obs) that are present on a large scale compared to the support of P(obs|true).On the other hand deviations on a small length scale compared to the support of P(obs|true) will be tend to be averaged out in the integration.This means that the scheme is sensitive to "long wavelength" errors in g r (obs) that are usually corrected for in the initial iterations by incorporating all the useful information on the dataset.Further iterations will take more and more into account "shorter wavelength" errors more likely deriving from statistical fluctuations in g(obs) data : the resulting corrections will tend to match g r (obs) to those fluctuations rather than to the proper g(obs) value corresponding to the best estimate of f (true).
The discretized version of the iteration technique described in Ref. [ 26] is based on the assumption that the response matrix R is positive definite and the input binned distribution y i in non negative so that the relation d =n -β β β = Rμ μ μ (from Section 3) can be inverted iteratively.The starting point is an hypothesis-guess on μ μ μ called μ μ μ 0 to produce the first estimate d 0 = R μ μ μ 0 .At the r th step of the iteration the estimate for the vector d is which is the discrete form of 48 with the correspondence R i, j →P(obs|true), μ r j → f r (true).This step can be defined "folding" and it corresponds to equation 48.Consequently the r th estimate of μ μ μ is obtained as in Equation 50 by "integrating" the response matrix R i, j over the updating function where d is the data input vector and corresponds to g(obs) data in Equation 50.The values of j represent efficiencies corrections to account for experimental acceptance losses.This step can be defined as the"unfolding" one and it corresponds to equation 50.From Equation 52, the convergence of the iteration is linked to the updating function y i /y r i approaching unity: in fact, given the normalization condition i R(i, j) = 1, y i /y r i → 1 implies μ r+1 j → μ r j .If the uncertainties are Poisson distributed such an iteration technique is empirically found to converge to the ML solution [ 30].

Iterative Unfolding: a recent example
An implementation of the general iterative approach is the technique of Ref. [ 31].The standard basic iterative steps outlined above are carried out.In this approach the updating function of equation 52 is obtained by defining the problem in terms of the relation of "causes" to"effects".The "causes" are defined as the content C i of the bins of the given, unknown true distribution distribution one wants to recover; the "effects" are the contents of the bins of the observed distribution E i .The connection between the "causes" and the "effects" is obtained by the simulation-derived response matrix P(E J |C i ), the probability that a given cause C i results into a given effect E j (coherently with to the definition of response matrix in Section 2).The losses due to limitations in the observations are represented by a common"trash" bin and need to be recovered by efficiency corrections, again estimated by simulation.It is emphasized that the variables the iteration estimates are the expected contents of the bins C i , the probability that a certain fraction of the total events are found in a given bin, not the overall probability for the distributions.So instead of writing Bayes theorem for a given distribution (spectrum) in the form of one restarts from the probability of "causes" (i.e. the bin contents) and so each "cell" (i.e.bin content) of the discretized distribution is considered an independent cause of an effect and the probability is written for the bins as With these definitions choosing P(C i |I) = p 0 = 1/M means that the probability content of a given cause i.e. a bin is a constant over the bins.This does not mean that all spectra have the same probably (it is not a statement on the probability of the distribution), on the contrary it implies a flat, precisely determined initial spectrum and consequently a very strong prior statement that would bias the result if no additional information were used: this is the starting point and the motivation for iterations.Due to this procedure the final estimate for the unfolded distribution is not expressed in terms of a given prior.
The iteration is then such that The first estimate of the number of events n(C i ) in the bin i of "causes" can be written as This can also be written as with where, in connection with Equation 52, n(C i ) is corresponds to μ 1 j , n(E j ) corresponds to d i , P(C i |E J , I) corresponds to R(i, j), P 0 (C l ) corresponds to μ 0 i and the denominator corresponds to d 1 i .The additional iterative steps follow exactly the evolution of the discrete scheme outlined in Section 8 (see the description of Equations 51 and 52).A distinctive feature of this approach is that the estimated distribution can be (and very often is) smoothed at each iteration step before the "unfolding" step by a customizable polynomial fit to a problem-specific function.In principle any smoothing technique 03002-p.17can be applied.This smoothing is not applied in the last step of the iteration.The iteration is continued until the solution is considered stable.The criterion for stability is dependent on the analysis; one suggested possibility is to quantify the agreement with the previous iteration by means of a least squares test.
An example from particle physics of the usage of the iterative unfolding technique can be found in the measurement of the distribution of the difference between the absolute rapidities (Δ|y t |) of the reconstructed top quark and anti-top quark in a sample enhanced in tt events obtained by LHC pp collisions at √ s = 7 TeV [6].In the standard model of particle physics this distribution is expected to show a slight asymmetry (at the sub-percent level) in the amount of events with positive and negative differences [6,32].The asymmetry is obtained by integrating the unfolded differential tt production cross section as Δ|y t |.The migration matrix is shown in figure 8.The measured and unfolded distribution for one specific set of events (featuring a single electron plus jets with at least one b-tagged jet) is shown in figure 8.A set of basic tests are performed to choose the number of iterations, the statistical and systematic uncertainty are propagated through the unfolding scheme and the stability and robustness of the procedure is tested.The number of iterations is such that the expected variation of the value for the asymmetry is stable within 0.1% in simulated tt events.The statistical uncertainty estimate is determined with simulated experiments and the systematic uncertainty is propagated to both the response matrix and the background.Simulated tt events are re-weighted so produce different samples with different true asymmetry.The analysis is performed on each different sample and the set input asymmetries is then plotted versus the resulting reconstructed asymmetries after unfolding to check the linearity of the unfolding procedure.The small biases observed in the reconstructed distributions and the extracted asymmetry are quantified by the largest relative deviation over all the bins and the mean uncertainty-normalized relative difference between true and unfolded values from the pull distributions, respectively.Such values are used to assign additional systematic uncertainties to the unfolded distributions and the final asymmetry.

Regularization unfolding and Entropy
Information theory can provide an alternative and deeply meaningful form for the regularization function of equation 34.Shannon's information entropy [ 33] for a given distribution is defined as: where p i is the probability for a given event/system to occur/be in a given subset i of the available phase space.The entropy H measures the amount of uncertainty represented by the probability distribution of a given variable and consequently determines the information content that any observation extracted from that population brings to the observer 9 .
When new information about a variable is acquired the gain can be quantified by the change in uncertainty (information) between the initial estimate of the probability distribution for the variable and the new one.As the entropy H measures the information change, it is at the basis of the principle of minimum relative entropy (or cross-entropy) [ 34]: if there is not enough information to specify a probability distribution uniquely, a consistent estimator for it is obtained by minimizing 9 An outcome from a distribution with a large Shannon entropy is more useful to the observer as it is less predictable than one with small entropy (which is actually fairly predictable): the observed outcome carries more information.where μ μ μ is the estimator vector for the unknown probability distribution, the index i goes from 1 to the number of M bins of the distribution and is the reference probability distribution, representing the best knowledge about the true, unknown distribution.This method is used whenever the true distribution is known to be non-negative everywhere.When the only knowledge about the true distribution is its being non-negative and the reference distribution is taken to be a constant over all bins ( i = 0 ∀i), the relative entropy of Equation 60 is reduced to the absolute entropy of Equation 59 up to a constant and the principle of minimum relative entropy is equivalent to the principle of maximum entropy [ 35].The axiomatic derivation [34] for the minimum relative entropy estimator defines it as the distribution μ i that has the minimal distance from the reference, initial estimate i in terms of information, but respects a given set of constraints.

03002-p.18
Additional insight into the use of information entropy is provided in Ref. [ 36] where the minimum relative entropy estimate is interpreted as a maximum likelihood estimate.The negative logarithm of the likelihood for a given set of binned observation n i to be compatible with a prior distribution i and to satisfy the the response matrix constraints (see Eq. 24 is considered.This likelihood is shown to be proportional to the regularization function S (μ μ μ) in equation 60 up to a constant term (see Appendix A of [36]).The likelihood for a given set of binned observation n i deriving from a true unknown distribution μ i to be compatible with a prior distribution i is represented by a multinomial distribution.The negative logarithm of this likelihood is shown to be proportional to the cross-entropy S (μμ μ) in equation 60 up to a constant term (see Appendix A of [36]).The distribution μ i is connected to the observed data by the response matrix and the likelihood for this requirement is generally represented by the generalized L(μ μ μ) of Equation 24.In the end in the estimate of μ μ μ minimizing the cross-entropy S (μ μ μ, ) with the response matrix constraint corresponds to maximizing the distribution φ(μ μ μ) in Equation 34, the negative logarithm of the full likelihood for the origin and detection of the observed events, in which the cross-entropy S (μ μ μ) is interpreted as the regularization function.In addition the interpretation of S (μ μ μ, ) as a "prior" p.d.f for μ provides the justification in a Bayesian framework [ 37].

Automatic Regularized Unfolding
An implementation of the minimum relative entropy principle to provide a the regularization function is present in the Automatic Regularized Unfolding (ARU) [ 38].This scheme is presently used to perform unfolding for one-dimensional problems.The algorithm does not require any parameter to be tuned, differently from the τ parameter for the Tikhonov scheme described in Section 7 or the number of iterations for the iterative techniques illustrated in Section 8.
ARU is a regularized fit.The unfolded distribution to be found, b(x), is parametrized as the sum of flexible and smooth piece-wise, non negative polynomial curves with finite support, b j (x), called Bsplines [39] i.e. b(x) = j c j b j (x) where the range of the index j is determined by number of non-zero derivatives and of grid points that characterize the B-splines chosen for the approximation [ 38].This solution form is folded with the detector kernel K(y, x) quantifying the miscalibrations, efficiencies and resolution effects to produce the function f (y) An extended maximum likelihood fit [40] of f (y) to the data is then performed by minimizing In this formula L 1 (c) corresponds to the negative logarithm of the overall extended likelihood function L 1 (c) = -log (L stand L norm ).The value of L stand is where K absorbs all the normalization constants, the set of y i are the observed values for the variable y, f (y i ) = f (y I |c)/v(c) and v(c) = dy f (y) = j c j F j with F j = dy f j (y).The likelihood L norm allows to include the variation of the normalization So, by using Equations 63 and 64 and the associated definitions, L 1 (c) can be written as where C = −logK + logN! includes constants that can be neglected for the purpose of minimization.L 2 (c) is the regularization term based on the relative entropy principle where the normalization B j = b j (x)dx is included.The reference distribution g(x) is chosen to be uniform while the weight w is determined by minimizing the mean integrated squared error (MIS E) on f (y) (that includes an estimate of the bias) An example of the performance of the technique is shown in figure 9. Unfolding is performed on a sample of one thousand events drawn from a distribution of two Gaussian convoluted with a Gaussian kernel.The regularized result is compared with the unregularized solution.Figure 10 shows the distributions obtained when performing 2000 simulated experiments with 100 or 10000 events in each experiment, respectively.The uncertainty estimate from ARU is consistent with the observed standard deviation and the average bias has the same size of the statistical uncertainty.A simulation study using 1000 pseudo-experiments of 100 and 1000 events each shows the distribution for the mean and the standard deviation of the unfolded distributions with a bias that is comparable with the statistical uncertainty of the solution.

Non-iterative Bayesian-inspired regularization
A non-iterative regularization scheme also inspired by Bayes' theorem was recently proposed [ 41].
The rationale is to find the probability for the spectrum of a variable as a whole, given the probability for the observed data spectrum and the migration model, according to Bayes' formula: p(T|D ∧ P) ∝ P(D|T ∧ P)π(T ∧ P) where T = (T 1 , T N t ) is the truth level binned spectrum (event density) in an N t -dimensional space, D = (D 1 , D N r ) is the observed binned spectrum in a generally different N r -dimensional space.D r is assumed to follow a Poisson distribution of mean R r where R = (R 1 , R N r ) is the N r dimensional spectrum of the expected observations.The matrix P is the conditional migration matrix whose element P t,r is the conditional probability P(r|t) for an event produced in the truth level bin t to be reconstructed in the reconstructed bin r.So P(r|t) is defined as where M t,r = P(t, r) is the joint probability for an event to be produced in the truth level bin t and in the reconstructed level bin r and t is the efficiency for reconstructing events in the bins of row t defined as The interpretation of Equation 68 is that the resulting p(T|D ∧ P) is the posterior probability density function (p.d.f.) for the truth level binned spectrum T, P(D|T ∧ P) is the likelihood of the observed binned spectrum D as a function of T and P and π(T ∧ P) is the prior p.d.f of T and P.
If the spectrum is such that the data are counts of events, the Poisson distribution can be used where the mean expected number of events R r in a bin r of D is and B r is the expected number of background in bin r.
The result of the unfolding is the posterior probability distribution function p(T|D) for the whole spectrum.The form of the posterior distribution in Equation 68 is the same as the regularized likelihood that generates Equation 35 if L(μ μ μ) is identified with p(T|D) and if a function S (T) is defined such that the prior distribution is written as and S (T) is identified with S (μ μ μ).Regularization is then interpreted as the inclusion of the prior distribution i.e. the a priori degree of belief in a specified property of the "true" spectrum T. The estimator T t for the content of bin t of the unfolded spectrum can then be derived by using more than one algorithm: from taking the mode or the mean of p t (T l |D) to considering the half point of the 68% integration interval to using the mean of the Gaussian fitted to the p t (T l |D).The uncertainty associated with the estimator is usually defined as the shorted interval in the range of T t for which the integral of p t (T l |D) amounts to 0.68.A crucial item for this technique is then the study of the convergence, stability and speed for the integration to be performed in Equation 74 [41].Figure 11 shows an example of marginalized posterior probability distribution for the content of one bin of a given"true spectrum" resulting from a simulated experiment [ 41].A steeply falling distribution of a given variable m is convoluted with an m-dependent Gaussian distribution aimed at simulating detector effects.The posterior distribution and some associated estimators are shown.

Unfolding schemes: additional examples
Other unfolding schemes tend to build on the basic techniques outlined above.The following examples are by no means an exhaustive list: its goal is to show that the problem of unfolding can be approached in a variety of manners that involve the evolution of old ideas and their merging with new schemes.Unfolding solutions vary depending on the problem at hand and unfolding schemes are used in science beyond the realms of nuclear and particle physics .These lectures should be considered a starting point to enlarge one's knowledge about the problem and an invitation to explore the available techniques and find or develop the techniques that is more suitable for the problem the reader is interested in solving.
In the realm of iterative techniques the Iterative Dinamically Stabilized (IDS) method [ 42] regularizes the statistical fluctuations by iterative steps in which simultaneous corrections to the normalization of the distribution and its shape are derived.Each iteration involves three stages.Initially a correction to the normalization is provided by weighting the difference between the data and the simulation (Δd) with a monotonic regularization function f (Δd) that depends on the statistical significance of the absolute Δd.The data-simulation difference for the content in each bin k, Δd k , is weighted with f (Δd k ) and the "true-to-reco" migration probability estimated from simulation.This difference is then used to correct the content of bin k in the simulation of the true distribution.The second step is an improved estimate of the background subtraction and the third step is an improved estimate of the response (migration) matrix.In both steps the bin content difference between the updated unfolded result and the true simulated distribution is weighted with f (Δd k ) (and additionally with the simulation-derived"reco-to-true" migration probability in the third step) to provide corrections to improve background treatment and migration modelling.The goal is to derive unfolding corrections that tend to preserve real new structure present in the data (but not in the simulation) while reducing the statistical fluctuations and biases due to background subtraction.An example of application of the IDS technique in particle physics is the measurement of inclusive jet and dijet production in LHC 03002-p.24 pp collisions at √ s = 7 TeV using the ATLAS detector [43] where the steeply falling distribution of the transverse momentum (p T ) for jets of colourless particles is extracted by correcting the p T distributions for jets formed by the energy depositions in the ATLAS detector.
Iterative schemes are also provided in variants that do not require binning events into histograms.Binning free methods avoid the problem of bin size optimization.They allow to transform from one variable into another and apply selection criteria, after unfolding; small size samples can be used in arbitrarily high dimensions where this becomes a problem for techniques relying on histograms.Finally the unfolded samples automatically provide the estimate of the statistical uncertainty associated to the measurement.
A first example is the Binning Free Deconvolution proposed in Ref. [ 44].For each iteration two stages are proposed.The first stage is to correct the weight of each single simulated event in the distribution of the variable of interest with the ratio of the experimental local density of events to the simulated one, both estimated by counting the events corresponding to an interval around the event value for the variable under consideration.The size of the interval of values is of the order of the experimental resolution.The second stage reassigns a weight to each event by averaging, for each event, the weights of the nearest neighbours in a given region.The size of the averaging region is variable and it governs the level of smoothing in the unfolding.Events are regrouped in bins after each iteration step.The iteration stops at the step after the minimum is reached for the regularized sum of squares defined as follows: a sum of squared differences between the bin content of the data and each intermediate corrected simulated distribution is added to a Tikhonov regularization term based on a second derivative estimate (like in equation 40).
The second example of binning free unfolding scheme is the Satellite Method [ 45], a technique aiming at obtaining the deconvoluted distribution associated to a certain data set in (generally) a multivariable space.Given the observed data sample a simulated sample with exactly the same number of the events is generated, representing the unfolded, true "positions" in the multi-dimensional space.Each true "position" is associated to a set of "observed" positions representing the detector effects on the "true" position (de facto sampling the response kernel): the relative position of these "satellites" with respect to the associated "true" position is kept fixed.The number of "satellite" realizations is a tunable variable of the method.In each step of the iteration a test variable is computed to assess the quantitative compatibility of the experimental distribution and the expected density for simulated events resulting from the "satellites".The advised quantity is the statistical energy defined with the same formalism of the potential energy of two opposite charge distributions: the distributions are derived from both the experimental data and the expected simulated density and the 1/r-law to be integrated is replaced by a general positive definite function (an exponential or a logarithm) of the distance between any two elements of the two samples.The value of the test variable is calculated at the beginning of the step and at the end, after having randomly chosen one of the "true" positions and after having moved it in the multi-variate space in a random direction, together with its satellites.The new expected density is recomputed and the test variable is recalculated: the migration is kept if the test variable has achieved a smaller value with respect to the beginning of the step, otherwise it is rejected.A new iteration starts if the test variable has not reached its minimum value.Regularization is implemented by either stopping the iteration before the results start to oscillate significantly or by varying the number of "satellite" points corresponding to a given "true" point: a larger number of satellite points implies a stronger regularization as the simulated information plays a more important role.
Finally s Plot [46] is a statistical technique aimed at reconstructing the unknown true distribution of a variable (control variable) for a given data set by knowing the distribution of other variables associated to the various sources of events that compose the sample (discriminating variables).The 03002-p.25 distribution of the control variable might be known for some sources of the events, but not for all.A crucial assumption is that the control variable is uncorrelated with the discriminating variables.The first step is to perform an extended maximum likelihood fit of the probability distributions for the discriminating variables to the various sources of events in the data.This fit returns the yields of the sources and determines the parameters of of the distributions.The distribution for the control variable in a given data subset (called s Plot for the given data subset) is then obtained by weighting all the events in the subset with a function of the distribution of the discriminating variables and their estimated covariance matrix (called sWeight).The sWeight is derived by solving a matrix equation in which the average of the control variable s Plot for the given data subset is expressed as a linear combination of the control variable special distributions ( in Plots) for all the subsets of the data and the coefficients are provided by the covariance matrix of the discriminating variables.The in Plot for a given data subset is the control variable distribution obtained by expressing the control variable as a function of the discriminating variables.The in Plot for a given data subset is obtained by weighting all the data events in a given control variable bin with the probability that each event belongs to the data subset of interest: the value of the discriminating variable for each event is fed to the known discriminating variable distribution to obtain such probability.
12 Unfolding software tools Some of the most recent unfolding schemes used in particle physics are now available as updated and maintained public software tools.These make it much simpler to use the proposed techniques and to provide feed-back that can be included in new versions, with general benefit to the users' community.
A sizeable collection of the available public software (and documentation) to perform unfolding in particle physics is reported under the Unfolding Framework Project [ 50].
Amongst the available tools, the ROOT Unfolding framework (RooUnfold) [ 51] usefully collects five unfolding schemes, including a basic implementation of unregularized ML unfolding through matrix inversion (see Section 3), the correction factor scheme outlined in Section 4, the regularized Tikhonov approach using the second order derivative described in Section 7 and a version of the iterative Bayesian-inspired scheme illustrated in Section 8.1.RooUnfold is a coherent C++ framework that can be linked against the ROOT libraries [52] that are widely used in particle physics analysis.
The most updated version of the iterative scheme of Section 8.1 is available at [53].
A precursor for the implementation of the Tikhonov scheme outlined in Section 7, coupling the second derivative regularization with a B-spline-based parametrization, is available in the FORTRANbased program RUN [55].The actual FORTRAN implementation of the Tikhonov scheme of Section 7 is available in the program GURU [56] and a C++ wrapper is also available [57] outside of ROOT.
The ARU scheme described in Section 9.1 is available in C++ format with a Python-based interface [58].

Optimization and choice of unfolding technique: the last two cents
Given an unfolding problem, the optimization of the balance between the bias and the total expected uncertainty, including full statistical and systematic effects, is a powerful criterion against which to scan the available degrees of freedom: unfolding scheme, binning of data, regularization parameter.Such optimization is usefully developed on large samples of simulated data so as to avoid biases induced by the specific features of the data under consideration.Quantitative figures of merit can be derived from the available information to drive the optimization [ 13,47].In the very common case of binned data, a basic figure of merit related to the total uncertainty is the mean squared averaged uncertainty over the bins of the distribution, defined as where U i,i is the diagonal element of the covariance matrix of the unfolded estimates for the ith bin content and b i is the estimated bias on each bin content as defined at the end of in section 5 and M is the number of bins in the distribution.In general different bins have different uncertainties.A modified MS E can take this into account by weighting the contribution of each bin with the inverse of the bin variance and by remembering that the mean and the variance of the Poisson-distributed content of bin i ,μ i , are equal: Another possible driving criterion is to require that the estimated bias squared, b 2 i , be smaller than its own variance V ii = cov[b i , b i ] so that there is no statistically significant bias.In this way the prescription can be : If the bias were statistically significant a correction for it could be considered and the uncertainty on the bias would then have to be included as systematic uncertainty on the final result.The inclusion of systematic uncertainties in the measurement needs to take into account the correlations that the unfolding introduces on a bin-by-bin basis and to derive their combined impact on the measurement.As an example, a very powerful way to achieve this is to use pseudo-experiments 10or toy experiments 11 : each experimental variable derived in a given pseudo-or toy experiment is the result of the sum of the effects of modifying a set of (assumed) independent modelling parameters 12 .The distributions of the modelling parameters are derived from some ancillary measurement or from an explicitly mentioned assumed distribution (usually a Gaussian or a Log-Normal distribution) and the effects are obtained by sampling such independent distributions.By calculating the observable(s) of interest in each experiment, its (their) distribution(s) can be calculated and, for instance, an estimator(estimators) for the mean(s) and the variance(s) can be obtained from the calculated distribution(s).The correlation between the various bin content can also be estimated by taking averages and variances over the pseudo-or toy experiments.This is a essentially Bayesan-inspired way of marginalizing the modelling parameters in the likelihood through Monte Carlo integration [ 59,60].An example of the formalism for the inclusion of the modelling parameters can be found in Ref [ 61] (see section 2 and Appendix A).The unfolding ingredients where the variations of the parameters need to be implemented are easily read off from the general likelihood solution and its elements shown in Equations 11, 13 and 36: the response matrix, the estimate of non interesting events (backgrounds), the acceptance corrections all need to be varied when input to the unfolding procedure according to their relation with the independent modelling parameters.In this way the all the correlations introduced by the unfolding are automatically kept into account.
In relation to the optimization, the variables that provide crucial insight in the origin of the unfolding problem represent sensitive tools to understand the role of the balance between statistical and systematic uncertainties.An important example is the condition number c(R) defined in Equation 23: this quantifies the general sensitivity of the analysis to incoming fluctuations, independently of their origin.When exploring the unregularized maximum likelihood solution, the condition number will change depending on the binning of the events and on the inclusion of certain systematic uncertainties.In the regularized realm an important example is the choice of τ which is based on the distribution of the input vector w i defined in Equation 44Such distribution is driven by the combination of information of the response matrix and the statistical covariance matrix of the inputs linked by the specific curvature-based regularization (as illustrated in Section 7): in general systematic uncertainties are not included in the input covariance matrix, so the optimal values for the τ parameter chosen according to the criteria mentioned in Section 7 need to be extended to reflect those effects, by considering the expected variations of w i due to the different systematic effects.
The choice for the algorithm and the optimization of its parameters is definitely dependent on the data analysis that is being carried out.Whatever the technique chosen for unfolding, it is useful to produce (and possibly report) the unregularized maximum likelihood solution using the matrix inversion solution of Equation 11: the unfolding technique does not add any bias to the result (see section 3) and it is equivalent to using the uncorrected data to test a theory or estimating its parameters, for instance by minimizing the sum of squares [ 47] χ 2 (θ θ θ) = (μ μ μ(θ θ θ) − μ μ μ ML ) T U −1 (μ μ μ(θ θ θ) − μ μ μ ML ) where U −1 is the covariance matrix on the unfolded measurement and θθ θ is the vector of parameters to be estimated by the minimization.The important point is that the full covariance matrix for the unfolded result needs to be used.Finally it is crucial to devise a test for stability and bias to check whether the unfolding scheme is robust with respect to possible fluctuations and what level of bias can be recovered by the unfolding.This is particularly important to build confidence that the unfolding is not diluting or cancelling important unexpected features of the data that are not foreseen by the models underlying the unfolding scheme.Typical tests are performed by unfolding simulated events whose true shapes are distorted with respect to the best knowledge encoded in the theoretical models included in the simulations.The induced distortions should be comparable with the total statistical and systematic uncertainty of the measurement.The distorted simulated test distributions are then unfolded and the deviation with respect to the input can be quantified, for instance, by using a least squares test between the input model and the unfolded result and exploiting the full covariance matrix including both statistical and systematic uncertainties.This provides a quantitative assessment of the capacity of the unfolding scheme to recover the input distortion and its ability to maintain "unforeseen" features present in the data.
Extremely useful insight into the unfolding problem is given in the lectures of Ref. [ 47] and Ref. [11], the chapter on introduction to statistics in Ref. [48] and the slides and proceedings of the PHYSTAT 2011 conference dedicated to unfolding [ 49].

EPJ 55 Figure 1 .
Figure1.Scheme for the evolution of the medical imaging process using figures from Ref.[1].The simulated photon pair emission density representing the brain (left, Figure2) is passed to a simulation of the Positron Emission Tomography (center, Figure1a) that produces the "observed" count distribution from the photon detector (right, Figure3a).The names of the figures are as they appear in the reference.

Figure 3 .
Figure 3. Scheme of evolution of the measurement of the invariant mass of the top-antitop quark system.The predicted mass distribution (left, Figure (10, right) in Ref. [5] (shown with the inclusion of theoretical uncertainties) for events featuring top-antitop quarks produced in √ s = 7 TeV pp collisions at the LHC is reconstructed (right, Figure 1(a) from Ref. [6] ) after the top quark decay products are measured by the ATLAS detector (middle, image from Ref. [7] ).A diagram from Ref. [8] shows the final state partons from the top quark decay at leading order.The names of the figures are those by which they appear in the references.

Figure 7 .
Figure 7.All the ingredients and results of an unfolding problem resolved with the n = 2 Tikhonov scheme from Ref. [19] (a) The response matrix connecting the true distribution to the measured one by encapsulating the model for the detection performance (b) The superposition of the true distribution (solid curve, labelled x test ), the measured distribution (dotted curve, labelled b) and the unfolded distribution (dots, labelled x tau with the choice of τ = s 2 10 , the tenth singular value c) The superposition of three versions for the absolute value of the covariance normalized, SVD-rotated input vector called: the unregularized values (solid, labelled |d i |), the regularized values(dashed, labelled d τ i , for τ =s 2 10 ) corresponding to equation 44 w, the arrow points to the boundary between significant and non-significant components of the unfolded solution.(d) the difference between the true distribution (x test ) and the unfolded result (x τ ) showing the one and two standard deviation statistical bands of the true distribution.

Figure 8 .
Figure 8.(a) Reconstructed distribution of the difference between the absolute rapidities of top quark and antitop quark (Δ |y|) in top quark pair events observed by the ATLAS detector in pp collisions at √ s = 7 TeV at the LHC.The observed data are represented by the dots, the predicted amount of events and their breakdown in different sources are shown in the histograms in different colours and illustrated in the legend.(b) Migration matrix from simulated top quark pair events.(c) Unfolded differential cross section for the production of top quark pair events as a function of Δ |y| (dots) compared with the prediction from the standard model (red histogram).All the plots are taken from reference [6].

Figure 9 . 3 Figure 10 .
Figure 9. ARU-unfolded distribution of a one-dimensional variable x in a simulated experiment using a dataset of 1000 events [38].The true distribution t(x) is "smeared" into the histogram corresponding to the data points.In this case the folded distribution f (y) used in equation 65 and the reference distribution g(x) used for the regularization in Equation 66 are on top of each other.The regularized solution b(x) is not showing undesired oscillations, differently from the unregularized solution b w=0 (x).

Figure 11 .
Figure 11.Example of one-dimensional marginal distribution for the content of one bin, p t (T l |D), derived from a simulated model (see Fig 13 and section 6.4 of [41]) (yellow-filled distribution).In this specific case l=11 as it is the 11th bin (out of 14) of the overall distribution.The red cross marker represents the input truth bin content T 11 .The black circle marker shows the simulated observed content D 11 .The blue square marker represents the most likely value of T 11 and the blue dashed line represents the shortest interval the integrates 68% of the marginalized distribution.The green dotted line shows the range in T 11 that is used to sample the posterior p(T|D to calculate the integral in equation 74. The posterior distribution p(T|D) is used to compute the marginal posterior distributions of the content in the bins of the spectrum p t (T t |D) for t ∈ [1, ...N t ].Such distributions are defined as p t (T l |D) = .. p(T|D)dT 1 ..dT l−1 dT l+1 ..dT N t (74)