The Controls and Configuration Software of the ATLAS Data Acquisition System: evolution towards LHC Run 3

The ATLAS experiment at the Large Hadron Collider (LHC) operated very successfully in the years 2008 to 2018, in two periods identified as Run 1 and Run 2. ATLAS achieved an overall data-taking efficiency of 94%, largely constrained by the irreducible dead-time introduced to accommodate the limitations of the detector read-out electronics. Out of the 6% dead-time only about 15% could be attributed to the central trigger and DAQ system, and out of these, a negligible fraction was due to the Control and Configuration subsystem. Despite these achievements, and in order to improve even more the already excellent efficiency of the whole DAQ system in the coming Run 3, a new campaign of software updates was launched for the second long LHC shutdown (LS2). This paper presents, using a few selected examples, how the work was approached and which new technologies were introduced into the ATLAS Control and Configuration software. Despite these being specific to this system, many solutions can be considered and adapted to different distributed DAQ systems.


Introduction
The ATLAS experiment [1] at the Large Hadron Collider at CERN relies on a complex Trigger and Data Acquisition (TDAQ) system [2] to gather and select particle collision data at unprecedented energy and rate.
The TDAQ system is an overly complex distributed computing system composed of a large number of hardware and software components (about 3000 computers and more than 50000 concurrent processes) which, in a coordinated manner, provide the data-taking functionality of the overall system.
The Control and Configuration (CC) system [3] is responsible for the software infrastructure which manages and operates the various components of the system and integrates them with the wider ATLAS data taking environment.
It is the software component taking care of configuring, controlling and monitoring all the TDAQ components and it has to guarantee the smooth and synchronous operations of all the various sub-systems. The CC system is a crucial actor in operating the TDAQ system. Indeed, a disruption in any of the basic and fundamental provided services would not only prevent to acquire LHC collisions data but would also undermine the capability to properly control and monitor a data taking session. For that reason, the CC system is asked to provide the means to minimize the downtime of the system caused by runtime failures.

CC Software evolution and upgrade towards LHC Run 3
Control and Configuration services and applications played an important role in successful TDAQ operations during data taking period in LHC Runs 1 and 2, allowing a high level of the data taking efficiency.
The first long LHC shutdown (LS1, from February 2013 to spring 2015) was primarily used to carry out a complete revision of the control and configuration software. Indeed, several packages were designed in late '90s and developed in the next decade. Additionally, new requirements, not foreseen when the system was originally designed, emerged during the ATLAS operations and were implemented in a less than optimal way. At the same time, new software technologies appeared that could easily replace or simplify several custom-made solutions.
As a consequence, the goals for the LS1 updates were three-fold: properly accommodate additional requirements that could not be seamlessly included during steady operation of the system; re-factor software that had been repeatedly modified to include new features, thus becoming less maintainable; and seize the opportunity of modernizing the software, thus profiting from the rapid evolution in IT technologies. The LS1 CC updates are discussed in detail in [4].
Based on what was done during LS1 and the additional experience gained during Run 2, a new campaign of software updates was launched for the second long LHC shutdown (LS2).
LS2 provided a novel opportunity for significant improvements of various parts of the system and to make use of new technologies and standards.
All the upgrades were carried out retaining the important constraint of minimally impacting the mode of operation of the system and public APIs, in order to maximize the acceptance of the changes by the large user community.

Developments for Run 3
The following sections give a description of the strategic choices and approaches have been adopted for the development of some of the major components of the CC system.

Configuration service storage improvements
The service provides the data taking configuration parameters. Such data describe more than 50,000 online processes distributed on more than 3,000 nodes with details of their control, monitoring, diagnostic, recovery, data-flow and data quality configurations, as well as connectivity and parameters for various TDAQ and detector modules, chips and channels. The configurations are prepared and updated by many experts from various DAQ, high-level trigger and detector groups [5]. Their consistency is one of the key requirements for reliable and effective functionality of the TDAQ system and the whole ATLAS experiment.
The configuration service is based on the OKS database storing data in XML files, developed in the middle of the '90s [6] and gradually improved by needs of the experiment. During LHC Run 2 we performed an evaluation of available database technologies looking for suitable candidates for configuration data storage and distribution. It was decided to keep the OKS database for implementation with few changes.

Data format changes
The first change was an improvement of the OKS XML data format to make it slimmer and more readable by humans. In particular, now the data payload is stored into XML attributes instead of elements, empty values and any internal counting numbers are avoided. Thus, there is much less probability that an expert updating some XML file in a text editor can make a syntax mistake.

New repository storage: Git backend
The configuration is stored in more than 1000 XML files, with various experts responsible for updating the ones associated with their systems and detectors. In 2008, a special service was implemented to validate any changes before they are committed into OKS repository verifying the consistency of the files and the update permissions based on user roles [7]. The service was successfully used during LHC runs 1 and 2 keeping configuration data consistent 100% of time, however some changes are needed for the future.
The service was based on CVS [8], that has been used by the ATLAS software during its development. CVS isn't supported anymore for many years and misses important security and interface improvements. Nowadays, the Git [9] is an obvious replacement for the service implementation demonstrating many benefits and profiting expertise and commitment by our software experts. We avoided several issues of the CVS implementation keeping the service design mostly unchanged: • Since Git uses transactions, now, several changes in interconnected files can be committed in one go. This was not the case in the CVS implementation, where every file was committed individually and a failure might leave the repository in an inconsistent state. • In the CVS implementation a configuration was read from a repository snapshot on a shared file system updated by the CVS server after every commit. Thus, the processes for the same run might get different configurations depending of the moment they access it. In the new implementation a run configuration is preserved by a unique Git commit hash, thus any process reads the same configuration independently of ongoing commits. • The old service implemented the repository validation on the client side and provided a special utility to commit changes on the CVS server, so the CVS interface was not exposed to the users. In the Git implementation the validation is performed in the Git server prereceive hook, so clients can use any Git interface including web editors. • When a configuration needs to be reloaded in the course of the data-taking session, the CVS implementation presented to an operator a list of modified files to be selected from for a new configuration. This error prone approach was replaced by selection of a configuration from newly available revisions with meaningful commit logs. The postponed changes are committed into git branches and git merge requests are used to handle them.
The OKS Git update workflow is presented in Figure 2. Gitea [10] is used as a Git server and it is only accessible inside the ATLAS experiment area. To modify a configuration, the user clones the repository, makes necessary changes, commits them and pushes back to gitea. The gitea hook validates changes, and, in case of success, stores them on the server, updates the read-only snapshot on NFS [11] and synchronizes with CERN GitLab server [12]. The NFS snapshot is only used for fast viewing of the configuration. The copy of repository on CERN GitLab can be used to read configuration outside of the experiment area and is accessible world-wide for authorized ATLAS users.

Archival
The last change is the new configuration archiving approach. In the past, every OKS configuration used for ATLAS data-taking run was archived in Oracle with fine grained details of individual attribute values of configuration objects. Thus, every new value of every attribute resulted in one or more new rows in the Oracle tables to be accessible independently of others. This was requested by the configuration task force before Run 1, but never used in practice since then. Instead, the archived configurations were accessed on the files granularity level only. It was then decided to drop the Oracle archiving and to use the CERN GitLab repository instead. Every OKS configuration revision can be accessed world-wide via secure HTTPS protocol using CERN on CERN Single-Sign-On (SSO). Any configuration used for data-taking runs is tagged in git with the run number, and such tag is stored into the Run Number database. Thus, every archived OKS configuration can be easily accessed by its run number.

CHIP: expert system
CHIP (Central Hint and Information Processor) [13] was introduced for the first time in the TDAQ system during the LS1 period. CHIP is actually an automation and error management service whose task is two-fold: maximizing the system efficiency and optimizing the person power. Indeed, through a continuous and comprehensive monitoring of the TDAQ system, CHIP has proven in Run 2 to be able to deal fast and effectively with error and failures, thus greatly reducing the need of manual interventions by the operator. CHIP's tasks can be divided into 3 main categories: • Handling abnormal conditions; • Automating complex procedures; • Performing advanced recoveries.
CHIP at its core relies on an open-source Java-based Complex Event Processing (CEP) [14] engine, ESPER [15]. The key features of the ESPER engine are the support for advanced stream analysis (correlation, aggregation, sliding windows, temporal patterns), a rich SQL-like Event Processing Language (EPL) to express the knowledge base, a natively high-configurable multi-threaded architecture, the support for historical data replication and built-in advanced metrics. During Run 2, the performance of a single instance of CHIP 1 was adequate to monitor the whole TDAQ system and handle all the streams of information generated during the data taking sessions (Figure 3), with data injection peaks into the ESPER engine of about 40 kHz. The evaluation of EPL statements was very efficient too, with an average execution time of about 2 µs per statement (weighted by the number of executions of a statement).
For the LS2 updates, the knowledge base was extended even further to cover new scenarios and implement stop-less operations for new components of the TDAQ system (i.e., the new read-out [16]). Currently the knowledge base counts more than 340 EPL statements in 29 different contexts, corresponding to a 13% increase in number of statements with respect to Run 2. Furthermore, LS2 was a good occasion to update the underlining ESPER engine to its last version 8. Even though the migration required some code modification and adaptation in CHIP, it brought sensible improvements in terms of performance and knowledge base organization. Indeed, the EPL statements are now compiled into Java byte code, hence improving the efficiency of the runtime execution, allowing a more fine-grained verification process and exposing features usually only available in high-level object-based programming language (e.g., the possibility to define streams of events as private in the context of a knowledge base module, thus greatly reducing the probability of hiding or overriding a certain data stream definition). As an example, the average execution time (wall time) for the most executed EPL statement could be reduced by more than 40%.   Figure 3. The average (blu line) and maximum (orange line) event injection rate into the ESPER engine over the whole 2018 ATLAS data taking period. As expected, the maximum injection rate has a high degree of variation. Indeed, the number of events generated by the DAQ system greatly depends on the state of the system itself (i.e., state transition, errors and anomalies). At the same time, the average rate is rather constant, with dips corresponding to the LHC machine development and maintenance periods.

Web applications and new requirements
In last decades, web applications are used more and more widely in the area of control and monitoring of the ATLAS experiment, gradually replacing traditional GUI applications (e.g. Qt-based). During LS2 we decided to develop a web application, providing functionality similar to the main controlling application for TDAQ data taking, called Integrated GUI (IGUI) [17] (Java-based). This functionality must include: connecting to a running TDAQ session for control or monitoring; presenting a hierarchical tree of different types of TDAQ applications, dynamically updating changes in their states; sending commands to controller applications; subscribing and browsing of Error Reporting Service (ERS) messages in real time; and less important things like application log files browsing. The application was named Web Run Control or WebRC.

Choice of technology: Apache Wicket
The main factors affecting the choice of technology for the WebRC application were the following: • Its backend part needs to be tightly integrated with main TDAQ services like Run Control, Information Service and ERS; • The frontend part should offer rich set of widgets, allowing to implement features similar to Java Swing elements; • It shall be well scalable and conservative in resource usage, allowing connections for many users and serving multiple TDAQ partitions in parallel; • Support of dynamic and interactive web features like Ajax or Web Sockets. used by the operators, experts and automated services to record and share information. The logbook comprises a web user interface (as seen in Figure 5), a REST API, a set of client libraries, and a set of command line utilities for programmatic-free access to the logbook operations [20]. The logbook uses a database backend to store its configuration. The facility provides a configurable email notification mechanism. Also, a logbook message can be replied directly from the user's preferred mail client, without accessing the web interface. The logbook implemements restricted access with multiple user authentication mechanisms. Developed primarily as the ATLAS experiment operations logbook, the ELisA facility is cur- rently being used for other projects such as detector development and commissioning work. During the LS2 period, we have mainly implemented solutions [21] to improve the portability of the logbook in private setups outside of the ATLAS working environment: • We have added support for other database technologies (MYSQL) in order to avoid the dependency from the centralized Oracle database which is quite a complex operation to setup; • We have extended the authentication mechanisms: beside the existing choices based on CERN Single-Sign-On (SSO) and LDAP servers, social media login (Google and Github) is now available. This could be potentially useful for setups outside CERN environment. • We have reduced the development and maintenance effort by using Spring Boot framework.
This framework eases the dependencies management, the XML configuration is rulled out almost entirely, an application is automatically configured given the project dependencies. As well, Spring Boot provides an embedded servlet container, thus eliminating the need for an external servlet container installation and maintenance work. • We have improved the deployment procedure by packing all the necessary software and configuration into an RPM, which is available in a public repository.

Conclusions and Outlook
The Control and Configuration software has contributed to the physics results obtained by the ATLAS experiment during Run 1 by ensuring smooth and efficient data taking. It was completely revised during the Long Shutdown 1 (2013-2014) period in order to accommodate additional requirements, improve maintainability and profit from advances in IT technologies. All this was done applying minimal changes to APIs, such that the large amount of client code would not need significant adaptations. The Control and Configuration software has proved to be stable, reliable and well performing in LHC Run 2 (2015-2018). In order to face the new challenges that will arise in Run 3 operations, the Control and Configuration software has undergone a further modernization process in different components during the Long Shutdown 2. The experience operating the TDAQ system, has also demonstrated that the overall modular architecture of the control and configuration system is flexible and supports partial upgrades, as well as step-wise modernization of its components. This is fundamental for a system that is foreseen to run for several more years and that will undergo several more upgrade iterations.