Status report on the architecture and future upgrades of the CMS Electromagnetic Calorimeter Control And Safety Systems

The Electromagnetic Calorimeter (ECAL) is one of the particle detectors of the Compact Muon Solenoid (CMS) experiment at the CERN Large Hadron Collider (LHC). For more than ten years, the CMS ECAL Detector Control System (DCS) and the CMS ECAL Safety Systems (ESS) have supported the experiment operation, contributing to its high availability and safety. The evolution of both systems to fulfil new requirements and constraints, in addition to optimizations towards improving usage and processes automation, led to several changes to their original design. This paper presents the current software/hardware architecture of both CMS ECAL control and safety systems and reviews the major changes applied to both systems during the past years. Furthermore, in view of the CMS Phase-II upgrade of this sub-detector, the corresponding plans for the control and safety systems are also discussed.


Introduction
The Electromagnetic Calorimeter is one of the particle detectors of the Compact Muon Solenoid [1] experiment at the CERN Large Hadron Collider [2]. In accordance with the CMS ECAL layout, a Detector Control System [3][4][5] was developed and divided in three main partitions: Barrel (EB), Endcaps (EE) and Preshower (ES). The successful physics data taking of the CMS ECAL strongly relies on the perfect functioning and high availability of the DCS. By controlling the powering systems and monitoring the environment inside the detector, multiple subsystems parameters and the off-detector infrastructure, the control system ensures smooth operation within the CMS ECAL specifications, detects potentially harmful trends and performs protective and preventive actions to keep the detector always in a safe state. Furthermore, the integrity of the detector is ensured by the CMS ECAL Safety Systems [4][5], while the entire experiment equipment is protected by the CMS Detector Safety System (DSS) [1]. This paper reports on the current CMS ECAL DCS and CMS ECAL Safety Systems architectures, discusses the most recent changes to the control system and presents the plans for the evolution and future upgrades of both systems.

Architecture
As illustrated in Figure 1, a large number of power supply channels, environmental sensors, hardware devices and external parameters are handled by the CMS ECAL DCS. Most of this data is also used by the CMS ECAL safety systems to evaluate the detector safety condition and to trigger hardwired interlocks to relevant systems in case of abnormal and potentially harmful conditions. Multiple hardware interfaces and communication protocols are used in both systems, making them quite complex and heterogeneous. The sections below present their architectures with some considerations regarding changes from previous reports.

Detector Control System
The CMS ECAL DCS is undergoing constant evolution, either to fulfil CMS ECAL requirements changes or to be in-line with the CMS DCS [6] guidelines, latest software framework components releases, current software platforms versions and evolving hardware technologies. The control system software is partitioned and distributed across a set of three servers, to balance the load while combining components with similar functionalities and software interfaces. To meet the CMS DCS requirements for software redundancy, all interfaces to hardware were converted to Ethernet [5]. Figure 2 presents a simplified version of the current CMS ECAL DCS architecture, highlighting the data chains and the clear separation between the EB/EE and the ES software components.

Detector Safety System
The safety systems for the EB/EE and for the ES were implemented separately following different approaches and guidelines. While the ES safety system PLC code in inline with other subdetectors safety systems, the EB/EE has a specific design driven by these partitions requirements. The ES safety system details are not part of this publication; however, a simplified block diagram of this system is included in Figure 3 for reference.

Software
For the high-level process supervisory management, the Supervisory Control and Data Acquisition (SCADA) architecture was implemented with the Siemens WinCC Open Architecture (WinCC OA) [7] control system toolkit. To add specific functionalities, CERN Joint Controls Project (JCOP) [8] and CMS DCS frameworks are extensively used. The integration of the CMS ECAL DCS with the CMS DCS is realised through a Finite State Machine mechanism, allowing the CMS Technical Shifter (operator) to monitor and perform a pre-defined set of actions on the CMS ECAL partitions. A three-level role-based access control allows a clear definition of views and actions that can be accessed by experts, operators or non-authenticated users. The CMS ECAL DCS features a set of automated scripts running health check routines to ensure data and system integrity. Central databases are used to store/retrieve configuration parameters and to archive system and detector conditions data.

Protective and Safety Actions
Protective automatic actions performed by control systems' software applications are frequently and erroneously considered to be used for safety measures. The control system can always introduce additional protective mechanisms but never replace autonomous actions performed by controllers or electronics circuits at the hardware level. The CMS ECAL features a 5-layer protection scheme ( Figure 4) to ensure the integrity of the detector and the off-detector infrastructure. The human intervention ( Figure 4 -layer 1) is considered a parallel and permanent protection layer, as it can trigger actions on all other layers at any time. Preventive automatic actions via software ( Figure 4 -layer 2), based on pre-defined warning limits, are performed by the control system to move the detector to a safe state prior to more severe actions from the safety systems. If the problem evolves to an alarm condition or in case of unavailability of the control system, the CMS ECAL safety systems ( Figure 4 -layer 3) will perform actions at the hardware level to ensure that the detector is moved to a safe state. The health of the CMS ECAL safety systems is monitored by the CMS DSS through hardwired heartbeat signals and in case of unavailability of these systems or problems related to the experiment infrastructure, the CMS DSS ( Figure 4 layer 4) will then perform actions at the hardware level not only to bring the detector to a safe state, but also to protect the experiment's equipment. If all four protection layers fail, a human intervention to press the emergency button ( Figure 4 -layer 5) at the CMS Control Room is the ultimate action and will shut down the whole experiment. The mechanisms implemented by the first four layers ensured that the use of the CMS Emergency Button was never required due to CMS ECAL-related issues.

Support and Maintenance
To contribute to the high availability of the CMS ECAL detector and consequently of the CMS experiment, on-call services are provided, and regular maintenance is performed. The sections below describe the most relevant tasks involved.

Support
The CMS ECAL DCS team provides two 24/7 on-call services: operator, who handles the daily operations and basic problems; and expert, who provides high level support in case of complex problems. To ensure an optimal response time in case abnormal conditions are detected, alerts to the CMS shifter and notifications via e-mail/SMS to relevant experts are automatically issued. The use of Agile Methodology for logging and tracking issues, provided and supported by the CERN IT Department through the JIRA Issues Tracking Service (ITS) [9], has proven to be an excellent approach to ease the control and safety system support.

Maintenance
Regular maintenance is performed to keep the CMS ECAL DCS in-line with the CMS DCS evolution. This includes the migration to the latest supported versions of software development platforms, verification and adaptation of CMS ECAL DCS components to new framework releases, installation and commissioning of new hardware interfaces and any other required modification to guarantee a smooth integration to the experiment's environment. Periodic tests are performed to ensure that all hardwired interlocks can be successfully triggered and received by the corresponding system or device. To minimize debugging at the production system, three test systems are available for software and hardware validation prior to their deployment at the experiment.
Modifications to the control and safety systems are carefully analysed and planned according to their impact and to the available time slots for deployment and validation. Small adjustments to the control system can normally be applied during the running periods. Changes to core software components, as well as small adjustments to the safety system, are planned for the short-and mid-term periods when the LHC is not delivering collisions, called Technical Stops (TS) and Year-End Technical Stops (YETS) respectively. Major changes to these systems are restricted to LHC Long Shutdown (LS) periods, which normally last at least one year. The upcoming second LHC shutdown (LS2) will take place from the beginning of 2019 until the end of 2020, and the third LHC Shutdown (LS3) is foreseen for 2024-2026. The next sections present the plans and major activities to be carried out during these periods.

LHC LS2
The existing CMS ECAL DCS software development guidelines need to be revised, extended and applied to all components. The new guidelines will focus on the standardization of naming conventions, quality metrics and other methodologies to improve code styling, shared functionality, debugging, and user interfaces. To ease the support and long-term maintenance, a feasibility study to assess the reduction from three CMS ECAL DCS servers to one, either by running the three existing WinCC OA projects on a single and more powerful server or by merging all components into a single project, will be carried out. A set of CMS DCS installation tools is used to perform unattended software deployment in the CMS ECAL DCS test setups. These tools access the CMS DCS installation databases, build the latest version of the software using the official code repositories, monitor changes and synchronize the development copies with the software versions running at the experiment. In addition to the three servers mentioned above, virtual machines from the CMS computing infrastructure are also used to allow the parallel development of multiple features and to minimize maintenance efforts. Performance issues in the communication between the DCS application and the low voltage power supplies were recently identified. A hardware intervention to re-organize the devices' distribution per communication bus is required. Such modification requires mid-to long-term testing/validation and can only be performed during an LHC LS. In turn, the DCS software needs to be re-configured according to the new devices' distribution. Following a request from the CMS ECAL Technical Coordination, a feasibility study to assess the increase of granularity in the DCS protective automatic actions will be carried out. In addition, other mechanisms and existing frameworks will be analysed and considered for the re-implementation and extension of these automatic actions in a less complex and more scalable way. The CMS ECAL EB/EE safety system uses custom units to read out the environment temperature inside the detector. The communication with the PLC system is realised through a very basic custom protocol and therefore reduces the overall system's reliability. Two possible solutions are being considered for implementation during the LS2: the rewriting of the PLC code to handle the weaknesses of the current protocol; or the replacement of all custom readout units by industrial modules from Siemens. In both cases, to allow further consolidation between the CMS sub-detectors' PLC-based safety systems, the migration to a common framework at the DCS layer is the preferred solution.

LHC LS3
Launched at the end of 2017, the R&D for the major control and safety systems upgrade foreseen for the LHC LS3 will continue in parallel to the LS2 activities. The DCS software is interfaced to the hardware through many different devices and protocols (Figure 2 in Section 2.1 provides a graphical view), which increase considerably the system complexity and maintenance. A project to standardize, concentrate and process all possible DCS hardware data into a PLC will be launched in the coming months. Most of the Siemens PLC modules of the CMS ECAL safety system will reach their endof-life (EOL) cycle in the coming years and the complete system must be replaced during the LS3. A fully redundant UPS-based power distribution featuring LiFePO batteries ( Figure 5[left]) and a non-redundant Siemens 1500-series PLC system ( Figure 5[right]) were built for the initial evaluation of processing resources and peripherals. The redundant versions of the Siemens PLC 1500-series CPU and relevant modules, planned to be used in the final configuration, will be released by Siemens in the coming months. Once these modules are available, a fully redundant system will be built for the next development phase. The existing non-redundant system will then be moved to one of the CMS ECAL test areas at CERN, to support the upgrade activities carried out in a spare partition of the detector. The existing readout electronics of the Precision Temperature Monitoring (PTM) system must be completely replaced to fulfil the new DCS specifications for the upgrade in the LS3. The first prototype ( Figure 6) was designed and produced at the beginning of 2018, featuring 32 channels, Ethernet interface and an absolute precision of 0.01°C (in-line with the CMS ECAL original requirements). The module is currently undergoing a long-term evaluation, where several necessary improvements were already identified. The current goal is to produce 64-channel readout units with similar profile as the Siemens 1500-series modules, to be mounted in the same standard rails, have their data read out via Ethernet (TCP or Modbus-TCP) and possibly concentrated and processed either by the safety system PLC or by a dedicated DCS PLC. Due to their limited readout range [5], the humidity sensors installed inside the CMS ECAL detector must be replaced in LS3. As a direct consequence, multi-channel readout electronics will need to be acquired or produced for the selected sensors. A setup to evaluate different types of humidity sensors was built (Figure 7) using the humidity generator HygroCal100 from Michell Instruments. Two sensors, Ohmic ABS-300 and LinPicco A420-G, are currently under evaluation and the first results are expected in 2019.

Conclusion
The CMS ECAL control and safety systems are complex and heterogeneous. The large amount of monitored data and the multiple communication protocols for interactions with the hardware and other external systems introduce significant challenges on their development and maintenance. Therefore, constant work to improve reliability and robustness, to optimize software and operational procedures and to reduce maintenance efforts, is carried out by the CMS ECAL DCS experts. Both systems will undergo significant upgrades during the next two LHC Long Shutdown periods. Some upgrade projects are ongoing, and several others will be launched in the coming months for the evaluation of hardware solutions and software improvements, with focus on standardization, optimization and the preferable use of industrial devices for the next system generation. Providing preventive and protective automatic actions, boosted by an extremely efficient 24/7 on-call service scheme, the CMS ECAL control and safety systems play a key role in the support of the detector daily activities, contributing significantly to the high level of availability of the CMS ECAL detector and consequently, of the CMS experiment.