High-Performance Optimization of Algorithms Used in the BM@N Experiment of the NICA Project

. Results of high-performance optimization of BmnRoot software modules are presented. The BmnRoot package used in the BM@N experiment of the NICA project plays a crucial role in the simulation and event reconstruction so its performance should be maximized to make the data processing e ﬃ cient. Results of performance analysis on representative testcases are given and bottlenecks are localized. Most suitable approaches to BmnRoot optimization are chosen and numerical estimates of the scalability of the parallelized modules for event reconstruction are presented.


Introduction
The BmnRoot software package is used in the BM@N experiment [1] of the NICA project to solve a great many different tasks. The main problems to be solved and the logical structure of the package are shown in Fig. 1. Both simulation and track reconstruction problems may be solved by running multiple independent modules with different typical times of execution ( Fig. 2) [2]. For example, the event reconstruction may take as long as several seconds per event, depending on the type of the colliding particles, the beam energy, the collision centrality and other parameters. Event simulation with realistic Monte-Carlo generators is also time-consuming. Processing the tens of millions of events may take significant time. Very large samplings must be produced by event generators to get reliable results, so any kind of performance-oriented improvement not only of the particles beam control [3], but of the simulation and reconstruction algorithms and their implementation is of the utmost importance.
A systematic approach to the performance-oriented optimization should take into account various aspects of the problem: 1) availability of a high-performance computing platform; 2) appropriate computational models; 3) the choice of efficient algorithms; 4) optimal software implementation; 5) usage of high-performance software libraries; 6) careful tuning of compiler optimizations; 7) optimization based on dynamic analysis of the application; 8) employing parallel programming techniques. The present study is devoted to the performance analysis and optimization of the algorithms used in the BM@N experiment of the NICA project and is based on some of the above mentioned aspects.

Performance bottlenecks of the BmnRoot software
The complexity of the BmnRoot package, the variety of the execution paths and their dependence on input parameters makes the dynamic performance analysis a necessity. For this purpose an instrumene-mail: sergey.merts@gmail.com e-mail: s.nemnyugin@spbu.ru  tation of source or binary files by functions that have access to hardware or system counters should be performed. In our study the performance analysis has been done using three approaches: • direct timing for some modules of the BmnRoot package which is implemented by insertion calls of standard timers in the source code; • usage of Google Performance Tools [4] for automatic localization of the most time consuming functions ("hotspots" of the program); • dynamic analysis of the BmnRoot modules by other software tools.
The results which have been obtained with all three approaches are consistent with each other. The analysis used the following testbenches and testcases. Testbench 1. CPU: Intel(R) Core(TM) i5-2400 @ 3.10GHz (4 core, no hyperthreading). RAM: 16 Gigabytes. OS: Linux (Ubuntu). Sampling size 5000 events for hotspot analysis (macros run_sim_bmn.C).
Testcase 3. Reconstruction for the LaQGSM generator. Sampling sizes: 5000 events for hotspot analysis and 4000 events for the study of scalability and quality assurance (collisions of Ar and Pb nuclei with energy 3.2 Gev/Nuclon, only the tracking in the inner detectors (Silicon + GEM) is included in the reconstruction, macros run_reco_bmn.C).
Some of the results of hotspot analysis are given in tables 1 and 2 (testbench 2 and testcases 2-3). It can be seen from Tab. 1 that the most time-consuming hotspots of the BmnRoot simulation part are system functions that may not be modified. As a consequence we have focused our attention on the track reconstruction modules [5].
One of the most significant hotspots of the BmnRoot package is the event reconstruction by Kalman filtering which is a de facto standard in particle trajectory reconstruction [6]. Other hotspots are the functions that deal with the magnetic field map. An advanced microarchitecture hotspot analysis has also revealed multiple inefficiencies in the code: data dependencies, inefficient use of pipelines and so on.

High-performance optimization of the BmnRoot
We have tested various gcc compiler optimization options for both the simulation and the reconstruction parts of the BmnRoot. The tests involved complex -O2 and -O3 level optimizations, aggressive vectorization, loops autoparallelization, profile-guided optimization etc. No significant effect was obtained, which is a consequence of the source code structure.
OpenMP parallelization was performed for the CellsConnection function. Implementation of the threadsafe parallelization required a modification of the algorithm used in the function. Its correctness was ensured by the Quality Assurance module. The scalability of the parallelized version is presented in Fig. 3. More efficient threadsafe parallelization of BmnRoot reconstruction modules requires a deeper modification of the reconstruction algorithm.

Conclusion
In this article we presented the results of systematic analysis of the BmnRoot software package with respect to the performance optimization. We have identified bottlenecks in both the simulation and the reconstruction modules of the BmnRoot software package. We have performed a partial parallelization of the reconstruction module, studied scalability of the parallelized version and observed up to 35 percent performance improvement. Further improvements of efficiency and scalability of the optimized BmnRoot modules require much deeper modification including a revision of the numerical algorithms being used. Hybrid programming for the General Purpose Graphics Processing Units and vectorization should also be analysed for their applicability to the BmnRoot package.