Partial wave analysis with OpenAcc

Partial wave analysis(PWA) is an important tool in hadron physics. Large data sets from the experiments in high precision frontier require high computational power. To utilize GPU cluster and the resource of super computers with various types of accelerator, we implement a software framework for partial wave analysis using OpenAcc, OpenAccPWA. OpenAccPWA provides convenient approaches for exposing parallelism in the code and excellent support for the large amount of existing CPU-based codes of partial wave amplitudes. It can avoid heavy workload of code migration from CPU to GPU. This proceeding will briefly introduce the software framework and performance of OpenAccPWA. 1 Partial wave analysis(PWA) The generally accepted theory for the strong interaction, quantum chromodynamics (QCD), remains a challenging part of the standard model in the low energy regime. Hadron spectroscopy provide a validation of and valuable input to the quantitative understanding of QCD. Partial wave analysis(PWA) is an important tool in hadron spectroscopy. In PWA, the full kinematic information is used and fitted to a model of the amplitude in a partial wave decomposition. The resonance’s spin-parity, mass, width and decay properties are accurately measured.[1] The most common approach to the partial wave analysis in modern experiments is the event-by-event maximum likelihood fit. In a fit, a maximum of the logarithm of the likelihood, corresponding to the best set of parameters for the used model is searched for. For example, the two-body decay amplitudes in the sequential decay process J ψ ⁄ → Nxγ, Nx → K K are constructed using the relativistic covariant tensor amplitude formalism [2]. In J ψ ⁄ → Nxγ, Nx → K K, Aj is the jth partial wave amplitude, which is described as Aj = Aprod−X j (BW)XAdecay−X (1) where Aprod−X j is the amplitude describing the production of the intermediate resonance Nx, (BW)X is the Breit-Weigner propagator of Nx, and Adecay−X is the decay amplitude of Nx. The total differential cross section dσ dΦ is dσ dΦ = | ∑ cjAj j + Fphsp| 2 (2) where Fphsp denotes the nonresonant contribution described by an interfering phase space term. * Corresponding author: xiaoyanjia@ihep.ac.cn © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). EPJ Web of Conferences 245, 06040 (2020) CHEP 2019 https://doi.org/10.1051/epjconf/202024506040


Partial wave analysis(PWA)
The generally accepted theory for the strong interaction, quantum chromodynamics (QCD), remains a challenging part of the standard model in the low energy regime. Hadron spectroscopy provide a validation of and valuable input to the quantitative understanding of QCD. Partial wave analysis(PWA) is an important tool in hadron spectroscopy.
In PWA, the full kinematic information is used and fitted to a model of the amplitude in a partial wave decomposition. The resonance's spin-parity, mass, width and decay properties are accurately measured. [1] The most common approach to the partial wave analysis in modern experiments is the event-by-event maximum likelihood fit. In a fit, a maximum of the logarithm of the likelihood, corresponding to the best set of parameters for the used model is searched for.
For example, the two-body decay amplitudes in the sequential decay process ⁄ → , → + − are constructed using the relativistic covariant tensor amplitude formalism [2]. In ⁄ → , → + − , is the jth partial wave amplitude, which is described as where − is the amplitude describing the production of the intermediate resonance , ( ) is the Breit-Weigner propagator of , and − is the decay amplitude of . The total differential cross section is = | ∑ + ℎ | 2 (2) where ℎ denotes the nonresonant contribution described by an interfering phase space term.
The probability to observe the event characterized by the measurement is where ( ) = and ( ) is the detection efficiency. ∫ ( ) ( ) = is the normalization integral calculated from the exclusive Monte Carlo(MC) sample. The joint probability density for observing events in the data sample is ∏ ( ) =1 . For a given data set, the ( ) is a constant and has no impact on the determination of the parameters of the amplitudes. So, the maximum likelihood function is According to the general form of the decay amplitude, can be defined as where , are the fit parameters, , the expression of partial wave amplitude( ) varies with the decay channel, and , = ∑ , =1

GPUPWA at BESIII
The Beijing Spectrometer III (BES-III) is an important particle physics experiment at the Beijing Electron-Positron Collider II (BEPC-II) at the Institute of High Energy Physics(IHEP). The pioneer approach of harnessing GPU parallel acceleration in PWA was performed in the framework of BES-III. [4] BES-III developed GPUPWA software framework based on OpenCL. GPUPWA uses the programming language of C + +, and its functions of fitting and drawing are realized by ROOT. [5] IHEP has established a GPU High Performance Computing Cluster.
On an Intel Core 2 Quad 2.4 GHz workstation with 2 GB RAM and an ATI Radeon 4870 GPU with 512 MB RAM, a ⁄ → + − analysis with four partial waves runs more than 100 times faster than the reference FORTRAN implementation for sufficiently large numbers of events.

Introduction to OpenAcc
The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator device, providing portability across operating systems, host CPUs and accelerators.
At its core OpenACC supports offloading of both computation and data from a host device to an accelerator device. In the case that the two devices have different memories the OpenACC compiler and runtime will analyze the code and handle any accelerator memory management and the transfer of data between host and device memory.

Fig. 1. OpenACC's Abstract Accelerator Model
OpenACC uses high-level compiler directives to expose parallelism in the code and parallelizing compilers to build the code for a variety of parallel accelerators. OpenACC allows parallel programmers to provide simple hints to the compiler identifying which areas of code to accelerate, without requiring programmers to modify or adapt the underlying code itself. It reinforces the ability of code transplant.

Fig. 2. Compiler hint
OpenACC compiler can generate parallel code on different platforms through this highlevel programming model, so that the application written by OpenACC has excellent cross platform performance. It is more convenient to use supercomputing resources. On the other hand, covariant tensor amplitudes of baryon spectroscopy are very complicated. The corresponding codes are difficult to be ported to GPUPWA (OpenCL).

OpenAccPWA
To utilize GPU cluster and the resource of super computers with various types of accelerator, we implement a software framework for partial wave analysis using OpenAcc, OpenAccPWA based on GPUPWA.
In OpenAccPWA, users need to provide the calculation formula of , , data sets of Data, MC and initial parameters according to their own physical analysis at first. OpenAccPWA can keep the original partial wave amplitude calculation code from C++. Fig. 3. Two partial wave amplitudes from the decay channel: ⁄ → + − as OpenAccPWA code Then multi iterations are needed to complete maximum likelihood fitting, the input parameters of the next iteration are dependent on the results of the previous iteration. So the iteration process run serially. Due to events are independent of each other, parallel computing can be used to speed up the process of calculating likelihood function. In addition, the matrix element , is only related to four momentum in some cases. Calculating , once will improve performance effectively.
At last, in order to find the real optimal solution, the fit has to be repeated over and over with new input.

Performance
On an Intel Xeon(R) 2X8 cores CPU and a NVIDIA Tesla K80(8) GPU, a ⁄ → + − analysis with two partial waves has been run. Several acceleration schemes are used to accelerate this example, including CPU, CPU(4-thread), multicore CPU(16 cores) and GPU.

Fig. 4. Performance of OpenAccPWA
The performance of OpenAccPWA with GPU are about 50~75 times than with CPU. The acceleration effect is more significant as the number of events increases. Using Multicore CPU as accelerator is faster than using GPU in the case of low statistics. The program speed is limited by data transfers between the CPU and GPU and GPU memory accesses.
Predictably, in computationally expensive PWA, OpenAccPWA with GPU will have a better performance.