DAQExpert - the service to increase CMS data-taking efﬁ-ciency

. The Data Acquisition (DAQ) system of the Compact Muon Solenoid (CMS) experiment at the LHC is a complex system responsible for the data readout, event building and recording of accepted events. Its proper functioning plays a critical role in the data-taking e ﬃ ciency of the CMS


Introduction
The Compact Muon Solenoid (CMS) [1] Data Acquisition (DAQ) system is responsible for reading out the data from one of the two general purpose experiments at the Large Hadron Collider (LHC). The accelerator complex provides proton-proton bunch crossings at a rate of 40 MHz, and the average size of each collision event is 1-2 MB. A two level trigger is in place in order to select only the most interesting data for storage and further analysis. At the first level, a hardware trigger selects the events at a rate of 100 kHz. Full events are read out and built from all detector electronics yielding a throughput of 200 GB/s. At the second level, the High Level Trigger farm of 35 000 cores reduces the event rate to O(1 kHz).
Due to the complexity of the system, issues related to hardware, software and networking cannot be excluded. The proper functioning of all components of the system is required for reliable data taking, otherwise the dataflow may be stuck or degraded. In order to minimize the downtime of the system, various recovery procedures have been prepared by the system experts. The operator crew, rotating in the control room 24/7, supervises the data taking and follows recovery procedures if needed. Support from on-call experts is available around the clock.
There is a human factor involved in this scheme of operations. We have observed that operators may make mistakes under time pressure and will add latency to the total intervention time. During LHC Run-1 and LHC Run-2 various automation mechanisms were introduced into the system [2,3].

DAQExpert
DAQExpert [4] is a service to identify and mitigate dataflow issues in order to improve the efficiency of the experiment. It provides guidance to operators and enables system experts to define the steps required to resolve operational issues in a timely manner. Additionally, it provides them with tools to perform post-mortem analysis. It constantly analyses the monitoring data from the Run Control system [5][6][7][8], and based on procedures defined by experts, finds the optimal way to recover when dataflow is stuck.

Scope
The datataking efficiency of CMS was 95.87% uptime in 2018. This is measured as a percentage of system uptime during total time of Stable Beams delivered by the LHC. There were 2184 hours of Stable Beams in total in 2018. Due to various issues with power supplies, other infrastructure, the LHC and the DAQ system 90 hours were not recorded. Forty-six hours of this downtime was assigned to DAQ related issues (93% subdetector problems, 7% central DAQ). DAQExpert aims to reduce this DAQ downtime, while the remaining 44 hours of downtime are outside of its influence.

Impact
The intervention time (a.k.a. mean time to repair, MTTR) is a key metric to measure the reliability of services, including the DAQ system. The intervention time consists of the reaction time of the operator and the recovery time itself, which is subject to constant change. There are many factors influencing the recovery time, namely: subsystem development, improvements in the run control system and special running conditions. The MTTR is therefore an inappropriate metric to measure the impact of DAQExpert. The reaction time, on the other hand, depends only on the individual abilities and alertness of operators which is expected to remain stable over the period of investigation. Moreover, the reaction time constitutes a considerable part of the total intervention time. Therefore it is an adequate metric to measure the impact of DAQExpert. DAQExpert aims to reduce the operator's reaction time and avoid human errors that are likely under time pressure. It was introduced gradually during Run-2. The service was first made available in the beginning of 2017, and has since been enhanced with user experience improvements and extended expert input. We have observed mean reaction time reducing from 101 seconds in 2016, to 65 seconds in 2017, and 47 seconds in 2018. This improvement can be attributed to the DAQExpert guidance.

The Human Factor
The human factor is introduced whenever the operator is allowed to make a final decision on the recovery procedure. The breakdown of the reaction time in Table 1 reveals a stable trend over the investigated period. Overall operator reaction time has been reduced between 2016 and 2018 by 59% for the fastest 95% and 75% of reactions, 66% for the fastest 50% of reactions and 54% for the fastest 25% of reactions. There is little improvement in 2018 in the fastest 25% of operator reactions, improving from 23 to 21 seconds. Figure 1 shows that vast majority of reaction times were in the range of 20-25 seconds with a significant positive tail. Additionally, it is clear that further reduction below 10 seconds is not feasible with humans operators involved in the process.  The most impactful way to improve the DAQ operations is to bypass the operator wherever possible.

Automatic Recovery
Automatic recovery is a functionality of the DAQExpert system. It has been introduced in order to avoid the latency of human actions, and the risk of wrong decisions. It was introduced in the end of Run-2, and the first recoveries were successfully carried out without operator involvement.

Architecture
The DAQExpert adopts a microservices architecture and consists of multiple services (see Figure 3), to provide guidance, enable post mortem analysis and conduct automatic recoveries. The snapshot service collects all relevant data from the monitoring data sources and persists them for further use. The reasoning service consumes the monitoring data in real time, in order to identify potential data taking problems. It allows DAQ system experts to encode their domain knowledge in the form of Logic Modules [4,9]. The notification service dispatches notifications to the system experts and operators. The latest addition, the controller, is in charge of performing the recovery procedure that so far was a responsibility of the operators. It sends the commands to the CMS Run Control system and keeps the operator crew updated with its actions. Web clients provide an interface for operators, with timely guidance (see Figure 4). Other user interfaces are dedicated to system experts and enable post mortem analysis. The project adopts Web technologies: ReactJS [10] and Bootstrap [11] for building the reactive user interfaces; Java [12], Spring framework [13], websocket [14] and servlet [15] for building the backend services with RESTful APIs delivering live and on-demand data. Oracle database [16], Hibernate [17], JDBC [18] are used for persistence; and Apache Tomcat [19] for serving the services.

First Recovery
First recoveries driven entirely by DAQExpert without any operator involvement were tested at the end of Run-2. The detailed report of the automatic procedure, which follows the expert recommendations is shown in Figure 5. Although there is not enough data to determine the impact of this improvement, the possibility of eliminating human latency and errors have been demonstrated. The reaction time has been vastly reduced, now consisting of only monitoring delay and necessary sanity checks. Based on operational data since 2016 we estimate that this feature would reduce the downtime of CMS by around 8 hours per year. This corresponds to 17% of total DAQ related downtime and 9% of total CMS downtime.

Summary
The expert tool is evolving together with the DAQ system, and it is improving as we learn more about the operational challenges. Its coverage and features are being improved and we continue to observe a positive impact on data taking efficiency measured by the reduction in reaction time of operators during intervention, and the demand for expert help in the form of Identifying that the human factor was limiting further improvements, efforts have been taken to bypass the operator in order to enable even quicker recoveries from data taking issues. The first automatic recoveries were observed at the end of Run-2 eliminating the overhead of human reaction latency, bringing the total reaction time down to the delay from the monitoring system, and the necessary sanity checks.
In 2018 the operator reaction time and overhead of wrong decisions accumulated to 8 hours of downtime and corresponded to 9% of the total CMS downtime. By introducing automatic recoveries we anticipate to significantly reduce this downtime in coming years.
The relevant metrics show that DAQExpert operations were successful in Run-2 and based on operational data from the last several years, a significant improvement in the CMS data taking efficiency is expected for Run-3.