ATLAS Sim@P1 upgrades during long shutdown two

The Simulation at Point1 (Sim@P1) project was built in 2013 to take advantage of the ATLAS Trigger and Data Acquisition High Level Trigger (HLT) farm. The HLT farm provides around 100,000 cores, which are critical to ATLAS during data taking. When ATLAS is not recording data, such as the long shutdowns of the LHC, this large compute resource is used to generate and process simulation data for the experiment. At the beginning of the second long shutdown of the large hadron collider, the HLT farm including the Sim@P1 infrastructure was upgraded. Previous papers emphasised the need for simple, reliable, and efficient tools and assessed various options to quickly switch between data acquisition operation and offline processing. In this contribution, we describe the new mechanisms put in place for the opportunistic exploitation of the HLT farm for offline processing and give the results from the first months of operation.


Introduction
ATLAS [1] is a general purpose experiment located at point one (P1) of CERN's Large Hadron Collider (LHC). ATLAS employs a large computer farm, summarised in Table 1, to facilitate data acquisition and event selection. The HLT is a mission critical part of the ATLAS experiment and is physically connected to the control network of the detector and the "data" network which allows connections to the CERN data centre through a switch at P1 [2]. The Sim@P1 project aims to opportunistically use the trigger and data acquisition high level trigger resources for offline computing. When working with Sim@P1 it is important to ensure the secure isolation from the physical resources at P1, seamless integration into the ATLAS distributed computing system, and reliable transition between the functions of the resources. Throughout this text we will refer to standard operation of the HLT as online mode and the operation as part of the ATLAS distributed computing system as offline mode.
A system satisfying these criteria was developed during the first long shutdown of the LHC [3]. Isolation is achieved by running virtual machines on the physical HLT hardware. The virtual machines were originally managed using the cloud framework OpenStack [4]. The virtual machines shared the "data" connection of the HLT hardware through a tagged virtual local area network (VLAN), which provided network isolation on the level of the Ethernet frame managed by the switches. This VLAN allowed the virtual machines to connect to a controlled list of interfaces in the CERN general purpose network. This list specifies the interface of the machines needed to allow offline workloads to be delivered and executed. To minimise impact of Sim@P1 on the Trigger and Data Acquisition (TDAQ) operation, only simulation tasks from the central production system are submitted to run at P1. The original implementation of Sim@P1 ran successfully during the first long shutdown of the LHC facilities between 2013 and 2015. Once the experiment resumed data-taking, the system was used opportunistically [5]. The HLT was switched from TDAQ function to offline mode for intervals of a few days during technical stops and machine development. To allow this opportunistic usage a set of scripts were developed to manage the transition of resources between online and offline function.
During the second long shutdown of the LHC facilities, starting in 2019, the HLT was upgraded. The changes to the HLT necessitated an upgrade of the Sim@P1 infrastructure 1 . A previous publication offered multiple options for this upgrade [6]. In this paper, we describe how the system was modified for operation with the upgraded HLT hardware.

The Sim@P1 infrastructure
During the year end technical stop from December 2017 to March 2018, the computing hardware of the HLT was replaced with new nodes. Some old HLT hardware was retained at P1 and is permanently operating in offline mode. The new hardware has been used offline during the various technical and machine development stops throughout 2018. The current hardware configuration is summarised in Table 1. Groups of 32 or 40 servers are organised into racks in the data centre at P1.
When a rack is not needed for data taking a shifter can set that rack to offline operation. This action triggers a change in the configuration database used by the TDAQ. The next time the configuration management system runs on any server, or trigger processing unit (TPU), in that rack 2 it changes the system configuration to reflect the change in the configuration 1 The Icehouse release of OpenStack does not support CentOS7, which started running on the HLT in 2019. 2 The configuration management tool, Puppet, runs once an hour on the TPUs.
database. An ephemeral disk providing 20 GB per core is created and a virtual machine instance is started 3 . This document will refer to such a running virtual machine as an instance.
Instances are contextualised using amiconfig 4 [7]. The contextualisation is delivered using an ISO image added to the instance by libvirt [8]. The ISO image is formatted as an OpenNebula data source. The contextualisation sets up the computing environment for the ATLAS offline workloads and sets the virtual machines to advertise themselves to a HTCondor system running in the CERN general purpose network [9]. Instances are configured to use HTCondor's dynamic partitioning feature to map workloads to the resources. Process isolation is achieved using the control groups feature of the Linux kernel.
The HTCondor system for Sim@P1 was rebuilt with a single central manager and four schedulers. The virtual machines are managed by CERN's configuration management system. Work is submitted to the schedulers using the PanDA Harvester [10]. Sim@P1 presents a single unified production queue to the PanDA system. The Harvester is operating the queue in push mode allowing the workloads to request the resources they need leveraging the dynamic partitioning of workers. The HTCondor system now directly notifies the PanDA workload management system when resources are added or removed from Sim@P1.
The new Sim@P1 network configuration still uses tags in the Ethernet frames to isolate traffic on a VLAN. Network security is improved by using a virtual router 5 that manages only traffic in that VLAN. This avoids traffic being forwarded to an unintended host by accident. Furthermore, the online traffic is given a higher quality of service guarantee than traffic on the Sim@P1 VLAN. That means offline activities can potentially receive the full bandwidth available on the network, but will never impact traffic from other TDAQ activities -such as data taking.

Content delivery
CernVM is a good solution for Sim@P1 because the micro-images can be distributed to all the TPUs and require a trivial disk space during HLT operation. Using CernVM means that we rely on the CernVM file system (CVMFS) to provide both the operating system content as well as the experiment software. In our CHEP2018 contribution we incorrectly stated that the load on the Frontier squids during the switch to offline mode was low [6,11]. This measurement was flawed: the instances were incorrectly contextualised to retrieve the CVMFS content from the Frontier squids operated by CERN IT for their batch system. Figure 1 shows that, after correcting the contextualisation, the Frontier squid at P1 was saturating its network bandwidth to deliver the content required by the booting instances. To address the bottleneck posed by the single squid we added a second squid instance, then added CVMFS caches that persist throughout HLT operation, and finally created a hierarchy of squids on the instances in offline mode.
The addition of a second squid doubled the total bandwidth used to deliver the content required for CVMFS. As a result the transition to offline mode finished in three hours: approximately twice as fast as with the single Frontier squid.

Persistent CVMFS caches
Much of the CVMFS content remains unchanged between successive switches between offline and online mode: changes to the files in the CernVM system tend to be minimal and an 3 The size of the ephemeral disk is reduced to ensure at least 20% of the hard drive is free. 4 Originally a project by rPath, Incṅow maintained by the CernVM team. 5 With a separate IP table. ATLAS software release once downloaded does not change. So keeping the CVMFS cache throughout online operation reduces the amount of content the caches need to provide when transitioning to offline mode.
A 50 GB virtual disk image was created on each TPU to serve as persistent CVMFS cache. The ephemeral disk may be smaller to compensate for the size of the cache. Libvirt is configured to mount the cache as an additional drive. The contextualisation checks for a second drive. If it finds the second disk exists and is formatted with the label "cache" it mounts the disk as the CVMFS cache. If the second disk exists but carries no labelled partition the disk is formatted with the label "cache" and mounted.  CPUs, and running processes as well as the aggregate one minute load on the TDAQ HLT farm as it transitions to offline mode. Below, using the same timestamps, the network load on the two Frontier squids delivering the CVMFS content to all the nodes is displayed. With the persistent cache the network traffic on the squids is greatly reduced. 6 The cache being warm means most content that will be requested already resides in the cache

Squid hierarchy
The contextualisation of the instances was adjusted to run a squid in two instances in each rack, as a complimentary approach to the persistent CVMFS caches 7 . The squids treat each other as siblings and the Frontier squid caches at P1 as parents. A web proxy-autodiscovery service is set up to connect virtual machines to proxy caches on boot: those serving a squid connect to the central Frontier squids at P1, others are connected to the two squids in the same rack. The CernVM team added functionality to the CernVM micro-kernel to delay the boot process until at least one proxy cache is up.

Operational experience
The new configuration of Sim@P1 was operational within two days of the upgrade to the TDAQ HLT. This quick transition is a testament to the simplification of the Sim@P1 infrastructure under the new configuration. A feature of libvirt which negatively effects returning the resources to online operation was discovered. The following section described the workaround that was implemented. The squid hierarchy provided a marginal improvement in the transition from online to offline operation while reducing the overall stability of the system: we started loosing racks when the two squids in the rack ceased functioning.

Returning the resources
Returning the resources to online mode quickly and reliably is absolutely essential, more so when ATLAS is taking data. Should an instance running on a TPU be busy with IO intensive work, it may take time O(10 s) for libvirt to destroy the instance 8 . Libvirt has a timeout of 15 s waiting on the destruction of a virtual machine, taking longer produces an error in returning that TPU to online mode. A helper script to manage the instance state was written. It attempts to destroy the instance three times, with a 15 s delay between attempts. With this modification, all resources were successfully returned to online mode without issues.

Other workflows
Data taken by ATLAS is cached at P1 and transferred to the CERN data centre. Tests performed at the end of 2018 have shown that we must assume the network between point 1 and the data centre to be busy transferring data from the cache to storage in the CERN data centre during technical stops. Offline operation must not interfere with the process. That means workflows requiring little data transfer across the network, such as simulation, are a natural fit for Sim@P1.
Event generation is a frequent task that requires no input and produces very little output. Previous experience with event generation found that depending on the software used and the physics signal generated these tasks have a long tail in the required memory. We created dedicated high memory and single core queues to explore the use of Sim@P1 for event generation. At the end of 2019, the first event generation tasks were successfully executed on Sim@P1. Since event generation in ATLAS is a workload using single cores, the four HTCondor schedulers were fully occupied managing event generation reducing the overall usage of Sim@P1. 7 The first and fifth TPU to ensure the squids are in separate chassis. 8 Usually an active application using swap space.

Future improvements
The infrastructure supporting Sim@P1 could be further improved to reduce the work required to maintain and operate the resource.
The success in running event generation in late 2019 shows that this could be done in an automated setup. However workloads that greatly exceed their memory allocation must be killed by HTCondor to ensure system stability. In addition, additional HTCondor schedulers must be commissioned to accommodate the increased number of jobs and limit the number of single core jobs allowed in the queue.
The hardware serving as Frontier squids operating at P1 needs to be replaced. Supporting greater bandwidth for the CVMFS and Frontier content distribution will further improve the rate at which the farm can be switched from online to offline mode. New hardware would mean the squid hierarchy described in Section 2.3 could be removed, improving the stability of Sim@P1.
As longer term project, OpenStack Heat or Kubernetes could be used to build an autoscaling pool of HTCondor schedulers. Adding volatile storage to serve as a cache inside P1 may allow more data hungry workflows to be executed without interfering with online operations 9 .

Conclusions
The TDAQ HLT of the ATLAS experiment at the LHC was upgraded at the beginning of 2019. The infrastructure for the opportunistic usage of this large computing resource, Sim@P1, was swiftly and successfully configured to function on the updated HLT hardware. This new configuration is simpler and more robust: it relies on low level Linux tools and systems maintained by the TDAQ administrators. The system has become much more responsive by adding persistent CVMFS caches. Future improvements promise to make Sim@P1 more robust, versatile and easy to manage.