Computing activities at the Spanish Tier-1 and Tier-2s for the ATLAS experiment towards the LHC Run3 and High-Luminosity periods

The ATLAS Spanish Tier-1 and Tier-2s have more than 15 years of experience in the deployment and development of LHC computing components and their successful operations. The sites are already actively participating in, and even coordinating, emerging R&D computing activities and developing new computing models needed for the Run3 and HighLuminosity LHC periods. In this contribution, we present details on the integration of new components, such as High Performance Computing resources to execute ATLAS simulation workflows. The development of new techniques to improve efficiency in a cost-effective way, such as storage and CPU federations is shown in this document. Improvements in data organization, management and access through storage consolidations (“data-lakes”), the use of data caches, and improving experiment data catalogs, like Event Index, are explained in this proceeding. The design and deployment of new analysis facilities using GPUs together with CPUs and techniques like Machine Learning will also be presented. Tier-1 and Tier-2 sites, are, and will be, contributing to significant R&D in computing, evaluating different models for improving performance of computing and data storage capacity in the High-Luminosity LHC era. * Corresponding author: santiago.gonzalez@ific.uv.es © 2020 CERN for the benefit of the ATLAS Collaboration. CC-BY-4.0 license © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). EPJ Web of Conferences 245, 07027 (2020) CHEP 2019 https://doi.org/10.1051/epjconf/202024507027


ATLAS Computing Model (Mesh Model)
In the original computing model of the ATLAS experiment [1] at the LHC at CERN, Tier-1 centers had associated Tier-2 centers, which were close-by in terms of network connectivity, and they formed a cloud, which in ATLAS means a regional setup of one Tier-1 and its Tier-2s in a certain geographical area. All data flowed to and from Tier-2s via its Tier-1. But more and more Tier-2s had excellent worldwide network connections and could exchange data directly between themselves. This motivated and led to the new ATLAS computing model (Mesh) where there are Nucleus and Satellites sites [2].
Tier-2s with a significant storage and excellent network connections are designated as "Nucleus" sites, passing job production on to smaller Tier-2s (Satellites) in any cloud, exchanging data directly. Currently, all the Spanish ATLAS Tier-2 sites are of the Nucleus type.

Spanish Tier-1 and Tier-2s inside ATLAS
The Spanish cloud (ES) [3] inside ATLAS has one Tier-1 at PIC in Barcelona providing 4% of Tier-1s data processing of CERN's LHC detectors ATLAS, CMS and LHCb. Also, it consists of one federated Spanish Tier-2 (ES-ATLAS-T2) at IFIC in Valencia, at IFAE (colocated in PIC) in Barcelona, and at UAM in Madrid with 50%, 25% and 25% of the resources respectively, and one Tier-2 in Lisbon, Portugal. ES-ATLAS-T2 represents 4% of the total Tier-2s resources and together with the Tier-1, they are integrated in the World Wide LHC Computing GRID (WLCG) project [4] and strictly follow the ATLAS computing model.
In Table 1 the Spanish resources in October 2019, together with their availability and reliability, are shown. The ES is at the top of availability and reliability ranks.

Development Activities for ATLAS Distributed Computing
The Spanish sites are actively participating in, and even coordinating, emerging R&D computing activities aimed at developing the new computing models needed for the LHC Run3 and HL-LHC periods. There is a collaboration in Spain among sites from other experiments and hence similar and/or complementary work is performed for other experiments, such as CMS and LHCb. This work is included in this section. Spanish sites are participating in the DOMA-TPC test and storage system performance studies for the implementation of the tape carousel [5]. We are developing a catalog of data of all events in all processing stages, called Event Index [6], which is needed to meet multiple use cases and search criteria. The ES is helping the Event Service project where the main goal is to allow for a more flexible and efficient usage of CPUs available when running ATLAS simulation jobs. Finally, we are designing and deploying novel analysis facilities on GPUs and adopting Machine Learning (ML) methods. We also contribute to monitoring tasks, like monitoring frontier-servers, site and cloud support tools and live pages.

Data access studies and storage performance
Data access and storage performance studies are important to further optimize data organization and management. In the ES these studies have been carried out by researchers of the CMS experiment working at Spanish sites, that run dCache gathering data from the billingDB [5]. The goal of these studies is to answer several questions. How is the storage system utilized? Is it working in the most optimal way? How to identify data that could usefully be cached and what benefits would this bring? If the sites are close enough (10 ms in terms of latency), should we aim for a data federation and/or consolidation of the deployed storage in the region? Detailed studies are being performed on how the files are accessed in time, with the aim of detecting cases in which storage system usage might be optimized, as well as identifying all of the possible cases in which data could be cached, turning into optimized deployed storage at the site. Simulations of caches and how to properly handle deletion of files from the cache, based on real usage or applying ML methods, are being investigated as well, prior to deploying caches in production in the region. The aim is to perform the same studies for the ATLAS experiment at some point soon, to have a coherent and almost complete view from a regional perspective (except LHCb).

Monitoring frontier-servers
Frontier-servers [7] optimize access to the so-called "conditions" database. It contains the operational parameters of all ATLAS subdetectors that are needed to run simulation or production jobs. A squid server provides caching of data and a Tomcat servlet connects to the Oracle database when required.
The monitoring system collects information about the queries from "log" files. It operates in a steady and stable way, processing 12M queries daily on average during data taking periods across all the ATLAS sites and allows the visualization of meaningful information by means of a dashboard. Figure 1 shows one of its pages with several relevant summary tables and histograms. The system incorporates an Alarms&Alerts subsystem to send e-mail warnings when a site performance deteriorates. CMS servers are also monitored.

Event Index project
The Event Index project [6] aims to provide a catalog of real and simulated data of all events in all processing stages, which is needed to meet multiple use cases and search criteria. Billions of events (PetaBytes) have been indexed so far since 2015. Some of the use cases are: event picking, duplicate event checking, overlap, trigger check, event skimming, or trigger counter. The latest developments are aimed to optimize storage and operational resources, in order to accommodate the higher amount of data produced by ATLAS, which is expected to increase in the future with a prediction of 35 billion new real events per year in Run3, and 100 billion at the HL-LHC. At IFIC we have improved the data collection system [8], and we are currently developing new storage schemas using HBase/Apache Phoenix for the final data backend with promising results.

Physics case and machine learning
The Spanish sites are involved in several physics analyses and some of them need an intensive computing power due to use of ML methods. Let's illustrate with an example: we are using a sample of simulated events where we have both ̅ resonances, which represent our signal, and Standard Model events as would be observed by an experiment running in a proton-proton collider. Our aim is to eventually apply ML algorithms to ATLAS data in order to improve the identification of the signal events and their separation from background processes [9]. We have evaluated the performance of several standard ML algorithms such as Neural Networks and Random Forest techniques provided by well-known Python libraries (Keras, Scikit-learn, …), thereby investigating parameters such as the number of layers, number of neurons, activation function, etc, as a first approach to understand the benefit of these methods. Figure 2 shows the AUC (Area Under Curve) as obtained for each resonance mass value from the percentages of signal and background identified as signal by the algorithm after its training. The samples used in Figure 2 come from a repository [10] with public ATLAS toy simulated events. At IFIC-Valencia we are using the ARTEMISA (ARTificial Environment for Machine learning and Innovation in Scientific Advanced computing) [11] facility where the core computing resources are based on GPUs (one server with four GPUs NVIDIA Tesla Volta and two servers with only one GPU each). The worker nodes are composed of several Intel Xeon Platinum CPUs and Tesla Volta GPUs. ARTEMISA is being used to find the optimal configuration of each ML algorithm. For this purpose, the code implemented in Python has been tested and first results with this facility in batch mode have been obtained, showing improvements on the performance running on CPUs in the IFIC batch system. The experience gained with the analysis use case by using ARTEMISA will allow us to be a first step in integrating future acquisition of Tier-2 project into the ATLAS computing workflows.

Challenges for Run3 and High-Luminosity LHC Periods
Higher luminosity is equivalent to higher data flow, so there is an increase on the horizon and this represents a challenge on storage, network bandwidth and processing power [12]. As can be seen in Figure 3, HL-LHC CPU estimations show a factor of three shortfall to the flat budget model and a factor of six shortfall to today's estimate in the storage on disk. The HL-LHC will require to increase the network bandwidth by a factor 10. Fig. 3. CPU (left) and disk storage (right) requirements prediction for Run3 and High-Luminosity LHC periods [12]. Approaches to solve CPU and storage shortfall are shown.
There are a few ways to face the CPU challenge: to use HPCs, cloud computing and High Level Trigger farms; to use fast instead of full simulation and speed up the Monte Carlo generators by a factor two; to run on GPUs but one needs significant time and effort to adapt the software to the new architecture.
Approaches for solving the storage shortfall are: increased investment in computing; new file formats; reduction of data volumes and more use of tape storage. This last option slows down the workflow. This will be the major impact on how we are going to deploy the infrastructure: data-lakes with centers connected to sites with large storage capacity, use of caches, smaller data, more use of the network, etc …

Integration of High Performance Computing resources
Opportunistic resources turn out to be a meaningful way to face the future HL-LHC challenges in terms of CPU requirements. Spanish HPC centers have been exploited recently, by deploying all of the necessary edge services that allowed us to integrate the resources into the ATLAS workflow management system. Since it is not possible to install the edge services on HPC premises, we opted to install a dedicated ARC-CE and interact with the HPC login and transfer nodes using ssh commads. ATLAS software including a partial CVMFS snapshot is packed in a container singularity image to overcome the network isolation for the HPC worker nodes and to reduce the software requirements. Lusitania and MareNostrum HPC resource integration in the ATLAS production system started in April 2018 in a collaboration between IFIC and PIC. Since then, we have been granted periods where we have been able to exploit several million hours of HPC resources, including 4.8 million hours in 2019 used by IFIC and PIC on the MareNostrum4 (MN4) facility. Figure 4 shows how IFIC and IFAE-PIC had a big contribution to ATLAS simulation Monte Carlo production using opportunistic HPC resources. The impact of the start of MN4 in the ES was to jump from 7k to 12k slots. Sixty million events were simulated in 2019 and more than 90% of jobs ended successfully. The Spanish community is working with the HTCondor developers to define a similar approach for CMS, which will allow the execution of CMS pilot jobs in MareNostrum [13,14].

Increasing bandwidth
Barcelona, Madrid and Valencia are connected by a dedicated network with a bandwidth of 20 Gbits/s that is expected to be upgraded up to 100 Gbits/s in 2020 thanks to REDIRIS (Spanish academic network provider -RedIris-Nova). This will allow the Spanish cloud to meet the network challenges for the Run3 and HL-LHC.

Toward a regional federation (Tier-1-Tier-2s)
With the goal of federating the resources at a national level, first studies have been done in order to understand how latency affects the job workloads when reading data remotely. Overflow of CMS analysis jobs from PIC (Barcelona) to CIEMAT (Madrid) and vice versa was enabled by deploying a regional XRootD re-director with high availability. Since May 2019, pilot jobs from PIC to CIEMAT and vice versa are flocked, since HTCondor BS is deployed in both sites on dedicated machines (80 CPU-cores available at each site). Sites are connected with low latency, measured to be around 10 ms. In these tests, regional input file reads are always preserved, since a regional XRootD redirector was deployed. This allows job degradation to be studied when some jobs are running data remotely in the region. These tests tell us that within the region, PIC could run jobs either at PIC or CIEMAT (reading files from PIC), except merge, which should run always locally without significant degradation (as expected, since it is the most I/O activity). This working mode would cause an increase in PIC exports, stressing both the network and storage systems. At higher latencies (40 ms), analyses start to be degraded, as seen with some tests made when running ATLAS jobs in the Amazon cloud (AWS) [15]. We plan to perform similar tests with ATLAS in ES sites.

Summary
The Spanish cloud contributes around 4% of the total resources deployed in Tier-1 and around 4% in Tier-2 sites. The Spanish Tier-2 has the so-called Nucleus state in ATLAS, which implies significant responsibilities and larger work volume. Not only are we contributing to the deployment of CPU and storage resources for the ATLAS experiment but also several research activities are carried out by the teams in the Spanish cloud. The current ATLAS computing model will not be able to cope with the Run3 and HL-LHC challenges. CPU resources, disk storage and bandwidth shortfalls have to be assessed and faced, for instance in CPU by using opportunistic resources like HPC and in storage by studying new data format and data processing. Concerning the bandwidth, requirements seem to be satisfied at this time.