Improvements to the LHCb software performance testing infrastructure using message queues and big data technologies

Software is an essential component of High Energy Physics experiments. Due to the fact that it is upgraded on relatively short timescales, software provides flexibility, but at the same time is susceptible to issues introduced during development process, thus mandating systematic testing. We present recent improvements to LHCbPR, the framework implemented at LHCb to measure physics and computational performance of complete applications. This infrastructure is essential for keeping track of the optimisation activities related to the upgrade of computing systems which is crucial to meet the requirements of the LHCb detector upgrade for the next stage of data taking of the LHC. Latest developments in LHCbPR include application of messaging system to trigger the tests right after the corresponding software version is built within LHCb nightly builds infrastructure. We will also report on the investigation of using big data technologies in LHCbPR. We have found that using tools such as Apache Spark and Hadoop Distributed File System may significantly improve the functionality of the framework, providing an interactive exploration of the test results with efficient data filtering and flexible development of reports.


Introduction
The LHCb experiment will be upgraded, starting from 2019, for the next stage of data taking of the Large Hadron Collider (LHC) [1].It is planned that the inelastic collision rate of 30 MHz will be processed by the full software trigger.One expects that an output bandwidth of up to 10 GB/s from the online trigger system to the offline computing system will be required in order to fully exploit the capabilities of the LHCb detector.Current data storage capacity and computing power are not sufficient to process data at this rate.This challenges several areas of the LHCb software and computing infrastructure [2].In particular the infrastructure for the software performance measurements is of special importance to keep track of the optimisation activities related to the upgrade of computing systems.The goal of such a framework is to automate the process of the execution and scheduling of the tests as well as to collect the metrics of interest and store them for the future reference.The crucial feature of the system is to ensure controlled conditions to obtain a reliable performance baseline.
As a result, one can easily inspect any changes in the software behaviour introduced in subsequent software versions.Furthermore, it is convenient to be able to compare results of the tests across various compilers and architectures.It should be also emphasised that the project being the subject of this work aims at providing the tools to analyse not only resource consumption such as CPU time or memory usage, but also to study the performance of the physics quantities derived from running physics applications.LHCbPR solves also the common issue that tests of software performance are ran only by experts which requires specific knowledge to setup, run and understand the test.In addition, such workflow is resource consuming, inefficient due to manual comparison, and not useful for the whole collaboration if results are not available publicly.Another strength of the proposed solution is that since LHCbPR tests are running on dedicated machines, they can run much longer than unit tests (making it suitable for use cases which require processing a lot of events) and provide more information than a simple boolean value.
The LHCb Performance and Regression framework was introduced in [3] and [4].The following paper elaborates on the recent developments of this tool.In section 2 the design of the infrastructure and current status of the system are described.Section 3 presents the prototype of the application of big data tools for the LHCbPR.The most common use cases of the LHCbPR are reported in section 4.

Infrastructure
The LHCbPR framework is organised following a microservice architecture.The main building components are: (i) back-end providing API service to retrieve the results of the tests from the database, (ii) web front-end enabling to browse, compare and plot test results, (iii) handlers responsible for parsing the output of the test results.Back-end and front-end modules are running as docker [5] containers and are combined using docker-compose [6].
The workflow in the LHCbPR system is presented in figure 1.The tests are triggered by user requests and messages coming from the LHCb continuous integration system [7,8].In the former case, the user specifies the application, its version, the option file stored in the dedicated repository and the handler module.The message with such information is sent to the queue of the tests, which are periodically consumed by one of the jobs (implemented in python [9] and bash [10]) in the LHCb instance of the Jenkins [11] infrastructure.In the latter case, once each application for a given platform (defined by the compiler, architecture and operating system) is compiled, the message is sent to the queue of builds which is checked against the lists of tests scheduled in XML file.The implementation of the message queue is done using the RabbitMQ message broker [12].Such a solution allows to efficiently use CPU time of the build and test machines as the test jobs are scheduled to run on dedicated nodes as soon as the corresponding software is built.In particular benchmarks measuring resource consumption are run on machines where the unwanted load is minimised.As a next step of the LHCbPR workflow, output of the tests is parsed by the handlers.Those python modules produce zipped JSON [13] files with the metrics of interest and optionally ROOT [14] files.Furthermore, the zip files are uploaded to Dirac Storage Element [15] and then imported into MySQL [16] database using the back-end implemented using the Django framework [17].The outcome is available in an AngularJS [18]-based dashboard.In the meantime, notifications about new results are sent to a Mattermost [19] channel for the interested users.The system allows to define for each monitored metric a corresponding threshold value which may trigger an alarm when exceeded.However due to high number of tests and rapidly evolving software, it appeared to be too difficult to maintain and is not commonly used.

Enhancement with big data technologies
Numerous technologies emerged to support the analysis of the big data, understood in terms of its volume, variety and velocity.Such tools are especially useful for data exploration and creating the data warehouses.One of the distinctive examples is the Hadoop ecosystem [20].The volume of the LHCbPR data cannot be considered as large-scale yet, however its variety makes Hadoop an interesting framework to be applied for the analysis of the results of the LHCb software tests.In particular, it enables to perform the interactive analysis of the data and therefore is more flexible comparing with the current approach based on the statically generated dashboard.Hence, the access to data is easier and faster.Indeed, short turn-around is desirable by data analysts.Moreover, combining those tools with shared notebooks makes such system perfect for collaboration and reproducibility.
The diagram presenting the prototype for the integration of the LHCb framework with the Hadoop ecosystem is shown in figure 2. It is based on Hadoop service provided by the CERN IT group [21].In this setup the results from the tests are stored on the EOS storage [22] apart from the LHCbPR database.Dedicated scripts merge the JSON files daily and copy them into the Hadoop Distributed File System [23] (HDFS).Afterwards the data is converted into the Apache Parquet [24] format partitioned by application name and the option file which significantly reduces the time needed for data filtering.A compression rate with respect to the JSON files was found to be of the order of 16.Parquet format was chosen since it provides fast data ingestion and random data access, and scalability [25].For data analysis, Apache Spark [26] processing engine is used through the integration with handlers frontend HDFS WebHDFS notebooks, in particular Apache Zeppelin [27] and SWAN [28].Some of the LHCbPR tests produce ROOT files as well as JSON files.To read them, the spark-root [29] package was investigated and found to be useful for that case.It is planned also to investigate WebHDFS protocol [30] to read LHCbPR data in web dashboard in parallel to approach based on HDFS and notebooks.It should be pointed out that it may not be possible to transform analysis modules in the existing web front-end into SWAN notebooks preserving the same usability and functionality.Therefore, our goal is to support both methods of data analysis in LHCbPR.Typical analysis in SWAN is based on the PySpark [31] module making also use of matplotlib library [32].There is also a way to introduce some interactivity by applying widgets provided by Jupyter notebooks [33].

Use cases
The flagship analysis performed in LHCbPR framework is to plot given measurements as a function of the software version.Memory usage of the application, CPU time spent by a given algorithm, reconstruction efficiency, or throughput (number of processed events per second by reconstruction code) are most common examples of the monitored metrics.Figure 3 presents CPU time spent in the event loop by a benchmark job which is a simulation of inclusive b-events in proton-proton collisions at 13 TeV performed using the Gauss simulation application [34].Labels on the x-axis correspond to different versions of Gauss.Decrease of the time for versions 4 and 10 is due to changes in the Ring Imaging Cherenkov (RICH) detector [35] code, specifically to speed it up which were introduced in corresponding software versions.This indicates the power of the LHCbPR framework as in an automated way one can observe any variations in the code output introduced by e.g.Gitlab's merge request [36], new external libraries or Monte Carlo generators.
Other use cases of the LHCbPR framework include rate and throughput tests for the High Level Trigger [37], and simulation validation [38].Furthermore, there are also automated tests employing code profiling tools such as perf [39] and IgProf [40], which make it possible to analyse where the program spends most time, and to investigate memory leaks.

Summary
Monitoring of the software is an essential tool in large scientific projects such as LHCb.
The LHCbPR system has already shown to be a versatile framework useful for the whole collaboration, improving the monitoring and control of the software developed for the LHC upgrade era.The LHCbPR project is not coupled to the LHCb software stack making it suitable to be used by other projects dealing with large and rapidly evolving code base.In addition, this paper indicates that big data tools appear to be promising for the integration with LHCbPR since they enable to efficiently create customisable reports on the results of the tests developed in shared notebooks making them reproducible and convenient to collaborate.

3 .
CPU time spent by the simulation application as function of the software version.