Integration of a heterogeneous compute resource in the ATLAS workflow

With the ever growing amount of data collected by the experiments at the Large Hadron Collider (LHC), the need for computing resources that can handle the analysis of this data is also rapidly increasing. This increase will only be amplified after upgrading to the High Luminosity LHC [1]. HighPerformance Computing (HPC) and other cluster computing resources provided by universities can be useful supplements to the ATLAS collaboration’s own WLCG resources for data analysis and production of simulated event samples. The shared HPC cluster "NEMO" at the University of Freiburg has been made available to local ATLAS users through the provisioning of virtual machines incorporating the ATLAS software environment analogously to the bare metal system of the local ATLAS Tier2/Tier3 centre. In addition to the provisioning of the virtual environment, the on-demand integration of these resources into the Tier3 scheduler in a dynamic way is described. Resources are scheduled using an intermediate layer, monitoring requirements and requesting the needed resources.


Introduction
The analysis of collision data collected at the LHC and simulation of events is primarily done at 2 Tier0, 13 Tier1 and 160 Tier2 sites within the WLCG [4]. Tier3 components at sites provide resources also for local groups. WLCG clusters are mostly set up for High Throughput Computing (HTC) to get as much compute power as possible. High Performance Computing (HPC) clusters, as provided by universities and other institutions, sometimes even co-located at the same sites, may be used for HTC-like workflows to extend the capacities of the existing WLCG resources. However, there are no standard interfaces or abstraction layers available to easily cross-link clusters with different HPC/HTC setups, different resource managers or different login schemes. Therefore, configuring clusters to distribute workloads to several schedulers and share monitoring information is not a straight-forward task.
In order to achieve the on-demand scheduling of resources on one cluster due to requirements on a different cluster, an intermediate layer, or resource scheduler, is put in place.
From its primary concept, the NEMO HPC cluster was designed to provide a full virtualization solution base on OpenStack [5] that enables users to spawn virtual machines (VMs) with a pre-configured image -so-called virtual research environments (VREs) [6]. These VMs are requested by sending a wrapper-job to the NEMO batch-system (MOAB [7]), which is queued in the same way as other jobs by NEMO users. When the wrapper-job starts, a virtual machine is spawned on the OpenStack instance. The lifetime of the VM is defined by the walltime of the MOAB job. After start-up and some initial checks, the VM is incorporated in the front-end scheduler on the ATLAS-BFG as an additional resource. This resource is in turn used to run the jobs that triggered the start of the VM in the first place. All of these mechanisms are completely transparent to the user of the ATLAS-BFG.
In the following, we describe how these virtual resources are integrated into the ATLAS Tier2/Tier3 cluster. NEMO and the Tier2/Tier3 are utilizing different batch systems. On the user-facing (or frontend) side, SLURM [8] is used as a scheduler, while the NEMO cluster (backend) runs a combination of MOAB & Torque. We use ROCED [9], developed at the Karlsruhe Institute of Technology (KIT) to schedule resources. ROCED is used already to integrate NEMO resources into the HTCondor [10] system at KIT. Due to the modular architecture of ROCED, it can also be used for connecting the two systems in Freiburg. To do so, a new component monitoring the SLURM queue on the frontend has been developed.
The performance of the system is being measured using several different benchmarks. These benchmarks are also used to quantify the modification of the performance due to changes in the configuration of the resource scheduler and of the virtual machines being spawned. They will also be part of a future continuous monitoring effort in order to be able to detect changes in the submitted workloads. This monitoring and tuning effort will ensure a robust but also dynamic and efficient environment.

Challenges
The ATLAS research groups in Freiburg have very specific requirements to the operating system as well as the installed software. This is to ensure reliable scientific results across all grid sites of the WLCG.
Virtualization has been found to be a technology that can simplify the challenge to provide a specific environment on a range of different heterogeneous and changing platforms, especially in the context of particle physics [11].
Being only one of multiple user groups on a shared HPC system, especially the choice of operating system has to take into account considerations from all user groups as well as from the party operating the cluster. A fully virtualized environment, independent of the choices made on the HPC cluster itself, will give the best possible scope to implement a system, that looks and behaves in the same way as the non-virtualized Tier2/Tier3 cluster. This consistency between the two systems would also make it possible in the future to redirect ATLAS grid jobs submitted remotely to either NEMO or any other opportunistic resource as long as the resource provides the needed infrastructure to run the VM images. The VM images which are made available to OpenStack on NEMO have to be created and updated easily in an automatic procedure and have to fulfil the following requirements: • Scientific Linux 6 [12] -current OS on the Tier2/Tier3 cluster • Access to ATLAS software via the CERN virtual file system CVMFS [13] • User environment from Tier3 • Access to both grid-aware datasets on the distributed storage system dCache [14] and the local NEMO parallel filesystem BeeGFS [15].
Since the VMs are completely self-contained, all features needed to monitor and benchmark the machine are independent of the two schedulers that are involved and can either be implemented on the VM itself or offloaded to the resource scheduler. In the future, this information will also be used for continuous monitoring of the robustness and performance of the system.

Generation of the virtual machines
The VM template is generated with Packer [16] and is based on a Scientific Linux 6 netinstall image. The customization and configuration of the template is done with puppet [17] which is also used for the contextualisation of the non-virtualized worker nodes on the Tier2/Tier3 cluster. Changes in configuration are automatically picked up by both systems. The output of this procedure is a static image that can be uploaded to the OpenStack server and is directly available.
In order to simplify software configurations across grid sites, most commonly-used software packages for the HEP-workflow are distributed using CVMFS. The software distributed on CVMFS is managed centrally by the experiments, hosted on web servers and transferred to the worker nodes on demand.
The VMs do not contain a predefined CVMFS cache and do not utilize the hard disk as persistent cache. A 256 MB RAM disk is being filled on demand from the local frontier squid [18] proxies. This circumvents an on-disk CVMFS installation on the host. This is a different approach to other solutions relying on access to software from CVMFS and using an on-disk cache like CernVM [19].

Connection of front and backend batch systems
The biggest challenge for a smooth operation is the interconnection between the two different batch systems -SLURM on the Tier2/Tier3 frontend and MOAB on the backend (NEMO) through a resource scheduler. The workflow is as follows: the resource scheduler monitors the frontend scheduler to which users send their workloads. The requirements are then compared to a list of available configurations on the NEMO HPC and the resource scheduler decides on the number of virtual machines for each configuration needed to fulfil the requirements. The appropriate number of batch jobs are sent to the backend scheduler, each spawning a virtual machine of the chosen type. The jobs that are used to start VMs in the OpenStack environment are regular user jobs on NEMO, which are started according to the availability of resources and the fair share of the NEMO user used to reproduce the fair share of the project. After startup of the VMs, they are integrated as additional resources into the SLURM scheduler and can then be used to process the user jobs queued in the frontend scheduler. Figure 1 shows the general mode of operation. For now, SLURM in the Tier2/Tier3 cluster is set-up with separate partitions that reflect the user's affiliation to one of the local ATLAS working groups. Each working group is in turn represented by a single user on NEMO, that is used to queue the jobs starting the VMs. By this mechanism, the fair share for the different areas of research using NEMO are incorporated into the workflow. ROCED monitors these partitions. After user's job submission ROCED starts a wrapper job on the backend batch system, mapped to a single dedicated user per group. The wrapper job then triggers the start of a VM with a given configuration on OpenStack. As long as resources are requested and available on NEMO, additional virtual machines can be started. This mechanism leads to a dynamic extension of the amount of compute nodes and job slots available for physics analyses on the frontend system. Before VMs are integrated into SLURM, a diagnosis-check is done to see whether all needed resources are available. After a successful check the VM is set to online in SLURM and jobs can be allocated.

Benchmarks
To understand the performance losses introduced by going to a fully virtualized environment, different benchmarks have been run. All benchmarks are carried out on the same hardware and the results obtained on the virtualized research environment are compared to the results running directly on hardware "bare metal") on both the Tier2/Tier3 and the NEMO cluster as well, to also assess the impact of different operating systems on the benchmark results.
In addition to the legacy HEP-SPEC06 (HS06) benchmark [20], the evaluation of the performance of the compute resources makes use of three benchmarking programs available in the CERN benchmark suite [21]: 1. Dirac Benchmark 2012 DB12 [22] 2. Whetstone benchmark [23] 3. Kit Validation benchmark KV [24].  The DB12 and Whetstone programs are evaluating the performance of CPUs through floating-point arithmetic operations. One of the main differences between the two benchmarks resides in the variables used as input to the arithmetic operations: DB12 uses random numbers generated according to a Gaussian distribution, while Whetstone utilizes variables with predefined values. The KV benchmark runs the ATLAS software Athena [25] to simulate and reconstruct the interactions of muons in the detector of the ATLAS experiment.
As our primary target is to measure performance of CPUs in the context of High Energy Physics (HEP) applications, the KV benchmark constitutes a realistic payload, more suited to our goal than the DB12 and Whetstone software. The DB12 benchmark is measured in units of HS06 and therefore can be compared directly to the results from the HEPS-  and the shared HPC cluster (NEMO). The performance has been evaluated on three different configurations: the Tier2/Tier3 and NEMO HPC clusters running both on bare metal and the virtual machines running on the NEMO HPC cluster -except for the KV benchmark, which currently cannot be run on NEMO bare metal nodes.
On the Tier2/Tier3, hyperthreading (HT) technology is activated and the number of cores that can be used is higher by a factor of two with respect to the physical number of CPU cores available. For the virtual machines, an arbitrary number of CPU cores can be requested. The operating system used is Scientific Linux 6 in both cases. The NEMO bare metal has no HT activated due to the more general use case of the system, and uses CentOS7 [26] as operating system. The scores of the HEP-SPEC06, DB12, Whetstone and KV benchmarks have been determined for these three configurations as a function of the number of cores actually used by the benchmarking processes. This number ranges from 2 to 40 for the Tier2/Tier3 bare metal and for the VMs running on the NEMO cluster, for which HT is enabled, and from 2 to 20 for the NEMO bare metal, for which HT is not implemented. The results have been determined by step of two core units. The benchmarks have been run 20 times for each core multiplicity value, and the means and root-mean-squares (RMS) of the corresponding distributions have been extracted.
The scores per CPU and the total scores are presented in Figures 2 and 3 respectively, for the four benchmarks and the three configurations considered, except for the KV software for which the NEMO bare metal results are not yet available. This latter benchmark retrieves the ATLAS software from the CVMFS file system, which is presently not available for the NEMO bare metal. A decrease of the Whetstone, DB12 and HEP-SPEC06 scores per CPU is observed in Figure 2 for increasing values of the core multiplicity for the NEMO VMs and NEMO bare metal configurations. This observed decrease is significantly less pronounced for the Tier2/Tier3 bare metal results. There, the scores per CPU remain constant until the maximum number of physical cores is reached for all but the HS06 benchmark. The behaviour for the hyperthreaded region between 20 and 40 cores is similar for the two configurations. The KV results exhibit a similar behaviour for both the Tier2/Tier3 bare metal and the NEMO VMs, with a constant number of events produced per second per CPU below the maximum number of physical cores and a decrease of the performance afterwards. This behaviour is the closest to the pattern expected for an ideal CPU benchmark: a constant CPU performance per physical core and a decrease in the region where HT is active. The Whetstone score per CPU at a core multiplicity of 20, the maximum number of physical cores available, is considered as an illustrative example of the benchmark behaviours on the three different configurations. An increase of the CPU performance by the order of 5% is observed when going from the Tier2/Tier3 bare metal to the NEMO VMs, while going from the NEMO VMs to the NEMO bare metal leads to a further increase of performance of the order of 5% as well.
A continuously increasing total score is observed in Figure 3 for the Whetstone benchmark on the three different configurations, while the DB12, KV and HEP-SPEC06 results are characterized by a flattening increase or a constant behaviour once the maximum number of physical cores has been reached. The Whetstone benchmark provides higher CPU performance in the HT region in comparison to the scores obtained with the three other benchmarks. The scores obtained with the KV and HEP-SPEC06 benchmarks indicate an increase of the CPU performance by 15 to 20% when going from the maximum number of physical cores to the upper edge of the HT region, while the Whetstone scores exhibit a larger increase of the order of 60%. The Tier2/Tier3 bare metal and the VMs running on the NEMO cluster share the same configuration in terms of hardware, operating system and hyper-threading. A given benchmark should therefore exhibit a similar behaviour for both configurations. The KV benchmark, besides being the more realistic estimator of the CPU performance in the context of High Energy Physics applications, is the only benchmark for which this expectation is observed. The behaviours of the different benchmarks still need to be studied in more details, in order to fully understand the impact of the operating system, hyper-threading and virtualization on the CPU performance.

Summary
The HPC cluster NEMO has successfully been integrated into the workflow of local users in Freiburg running ATLAS data analysis jobs. This has been achieved in a transparent way using full virtualization on NEMO and utilizing the resource scheduler ROCED for the integration of the virtual resources into the frontend scheduler. The system is in production since fall 2017. First performance tests using different benchmarks show some degradation in performance of the virtual machines compared to running on bare metal. The differences are within the expected ranges when using different operating systems and are significantly reduced when going from pure CPU benchmarks like Whetstone or DB12 to benchmarks more closely related to high energy particle physics analysis like HS06 or KV.
The continuous benchmarking effort will ensure a stable and efficient environment. Integrating the results of the benchmarks into the resource scheduling process can enable cluster administrators to test different configurations, leading to a more efficient usage of the provided resources.