Overview of the ATLAS distributed computing system

The CERN ATLAS experiment successfully uses a worldwide computing infrastructure to support the physics program during LHC Run 2. The Grid workflow system PanDA routinely manages 250 to 500 thousand concurrently running production and analysis jobs to process simulation and detector data. In total more than 370 PB of data is distributed over more than 150 sites in the WLCG and handled by the ATLAS data management system Rucio. To prepare for the ever growing LHC luminosity in future runs new developments are underway to even more efficiently use opportunistic resources such as HPCs and utilize new technologies. This paper will review and explain the outline and the performance of the ATLAS distributed computing system and give an outlook to new workflow and data management ideas for the beginning of the LHC Run 3. It will be discussed that the ATLAS workflow and data management systems are robust, performant and can easily cope with the higher Run 2 LHC performance. There are presently no scaling issues and each subsystem is able to sustain the large loads.


Introduction
During Run 2 the Large Hadron Collider (LHC) continues to deliver large amounts of data to the experiments (see Fig. 1 left). The distributed computing system of the ATLAS experiment [1] as outlined in Fig. 1 right is built around the two main components: the workflow management system PanDA [3] and the data management system Rucio [4]. It manages the computing resources to process this data at the Tier-0 at CERN, reprocesses it once per year at the Tier-1 and Tier-2 WLCG [5] Grid sites and runs continuous Monte Carlo (MC) simulation and reconstruction. In addition continuous distributed analysis from several hundred ATLAS users is executed. The resources used are the Tier-0 at CERN and Tier-1/2/3 Grid sites worldwide and opportunistic resources at HPC (High Performance Computer) sites, Cloud computing providers and volunteer computing resources submitted via ATLAS@Home/Boinc [6]. All the infrastructure is monitored by different shift teams for production and distributed analysis. The ATLAS distributed computing system is performing very well in handling and processing all this data.

Data processing in 2017 and 2018
ATLAS has been using much more CPU resources than the officially pledged in 2017 and in 2018 so far (see Fig. 2 and Fig. 3). The pledge of CPU resources sums up to roughly 250k  concurrently running Grid job slots in 2017. Many ATLAS WLCG computing sites provide more CPU power over the pledge. In addition the ATLAS workflow management system is very efficiently exploiting additional non-WLCG opportunistic resources that have been mentioned in Sec. 1. Beyond pledge and opportunistic resources are mainly used for MC event generation and simulation and have allowed the production of almost twice the number of fully simulated events than would have been possible with the pledged resources only. Fig. 3 shows the normalised HEPSPEC06 wall clock time which is calculated from average of the HEPSPEC06 normalisation values of CPUs at a site and the number of running jobs at a site as shown in Fig. 2. CPUs at HPC centres are usually weaker than those at Grid sites and have lower HEPSPEC06 normalisation values. This causes the peaks in the number of running jobs to disappear in the equivalent HEPSPEC06 distribution.

Data management
In the middle of 2018 the ATLAS data volume exceeds 370 PB in disk and tape storage. There are more than 20 million datasets with more than 1 billion files stored world-wide at all the ATLAS computing sites. About 180 PB of data are stored on disk and partially replicated with one additional copy at other sites to improve the availability and job throughput for popular data (see Fig. 4). The data transfers between the sites reach an aggregated traffic of 20 GB/s in a weekly average with daily averages up to 40 GB/s. On average 1.5-2 million files per day are transferred between different sites with a volume of more than 1 PB per day (see Fig. 5). A large fraction stems from delivering the input data to production jobs especially for the so called derivation framework. This workflow processes the reconstruction output AOD files into derived AOD files (DAOD) which are used in different subsequent physics https://doi.org/10.1051/epjconf/201921403010 CHEP 2018 analysis. The data transfer system is currently limited by the maximal number of files that can be transferred simultaneously in the WLCG files transfer service FTS [7].

Production activities
The major production activities in 2017 and 2018 so far are the MC event generation, simulation and reconstruction campaigns MC15 for 2015 and 2016 data and MC16 for 2017 and 2018 data (see Fig. 6 left). These activities took about 72% of the Grid resources in 2017 and will take a similar amount in 2018. The group production of DAODs for data and MC took about 15% of the resources while a smaller fraction went into the reprocessing of the 2015-2017 data and distributed analysis by individual ATLAS users. MC simulation, reconstruction, derivation production and data reprocessing use the multi-processing workflow to allow efficient memory usage of the available Grid resources with a maximum of around 2 GB per CPU core.
In 2017 almost 10 billion events have been simulated using full simulation and about 1.6 billion events using fast simulation. Fast simulation uses a parametrised simulation of the ATLAS calorimeter and allows to simulate events with less CPU time used but with a slightly less precision. Fig. 6 right shows the breakdown of produced full simulation MC events per resource type. While Grid resources provide the vast majority of the capacity available to ATLAS, the contribution of Clouds and HPCs (14% and 9% shares respectively of full MC simulation event production) is significant. HPC resources are often limited in their capability to a few workflows, contributing mostly to MC simulation and, in the case of architectures different than x86_64, to event generation only.
The 'Data Processing' category had two major phases of activity with the reprocessing of 2016 data in the spring 2017 and the reprocessing of 2017 data in Winter 2017/18. A major campaign of derivation production using all 2015-2017 data and some MC was done in Winter 2017/18.

Tier-0
The Tier-0 centre at CERN is responsible for the safe-keeping of the RAW data and its distribution to the Tier-1s. The first pass calibration and reconstruction is done within days after the data has been taken. The ATLAS Tier-0 delivered excellent throughput during 2017 and  2018 so far, benefiting from the workflow and infrastructure improvements. The CPUs cluster is configured without hardware hyper-threading, providing 4 GB of memory available for each slot and making it possible to run the full Tier-0 processing chain in single core mode. The workflow reads the RAW detector data and produces all output formats in a single step, keeping the intermediate ones in memory and so reducing the IO intensity. Fig. 7 left shows the number of running jobs in the LSF batch system dedicated to the ATLAS Tier-0. The LHC operation efficiency reached 74%, during a week in October 2017, with an average between 50% and 60% during the data taking period. In the periods of inactivity for first pass reconstruction (in blue), the system is backfilled with Grid processing (in red), with the Tier-0 resources constantly in full operation. It can be seen that in the period from mid-October to the end of November, when the LHC was operating in a mode of higher average number of proton-proton interactions, the Tier-0 was fully occupied by prompt reconstruction jobs and a backlog of up to 250k jobs was accumulated (Fig. 7 right). This was due to a higher average trigger rate (up to 1.5 kHz) and longer time/event needed to reconstruct events at higher pileup values. During the LHC winter shutdown the Tier-0 was used mainly for Grid activities and in Spring 2018 switched back to Tier-0 prompt reconstruction workflows after the LHC restart.

Distributed Analysis
ATLAS distributed analysis is performed on the Grid by several hundred individual ATLAS users. A large fraction of events are processed from DAODs (see. Fig. 8). Some CPU time is devoted to process or generate other formats mainly for developing or testing new software algorithms or for systematic studies of physics analysis.

Monitoring and Analytics
The various components of the ATLAS distributed computing system are monitored in different ways using multiple tools. Usually there is dedicated monitoring for each system. An example of such a single system monitoring is shown in Fig. 9 left with a screen shot of the PanDA workflow management system monitor webpage. This monitor is based on the PanDA job information stored in the Oracle SQL database and provides extensive and detailed information about the system.
Since ATLAS distributed computing needs to account, monitor and optimise the usage of all computing services and resources combined, an analytics platform has been developed and setup that collects information from all components and combines them where possible. Fig. 9 right shows a screen shot of a Kibana dashboard of new ATLAS analytics platform [8]. This system is based on an NoSQL elasticsearch cluster [9], custom made data collectors and data enrichment on a Kubernetes cluster [10]. Many detailed studies of the ATLAS Grid system and workflow performance have been carried out and several workflow improvements have been introduced using this combined information [11].

Evolution of the systems
The ATLAS workflow and data management systems and the Grid infrastructure are still evolving. Some major new features are being developed or commissioned. To exploit all available resources an efficient job scheduling is necessary. The rather diverse resource usage capability of the ATLAS workflow management will be improved with developments of Harvester [12], Event Service and Event Streaming Service [13] and a new version of the PanDA pilot as described further in this section.
ATLAS is running a diverse mixture of workflows and different processing campaigns at different times. Single and multi process workflows are submitted to many resources. So called global fair-shares and unified queues have been introduced to limit the CPU slots per activity and boost activities when needed. The currently partitioned production and analysis queues will be unified into a single queue per site to allow more efficient scheduling. The PanDA pilot code that executes the job payload on the site worker nodes has been redesigned for easier site support and is currently being commissioned.
Harvester is being developed to optimise the allocation of jobs to CPUs from different resources like the Grid, HPC and Cloud. Since every HPCs has its own job scheduling mechanism and rather restricted user access policies, Harvester was developed initially for HPCs as a common transparent interface to these resources. Now it also allows access to Grid and Cloud resources in the so-called PanDA pilot pull and push modes and is currently being commissioned at many ATLAS WLCG Grid sites.
The development of the Event Service (ES) workflow and Event Streaming Service (ESS) allows fine grained event level processing instead of the classical file level based processing. The ES assigns fine-grained processing jobs to workers and streams out the data in quasi-real time, ensuring fully efficient utilisation of all resources, including volatile resources. The ESS is in development to asynchronously deliver only the input data required for processing when it is needed, protecting the application payload from WAN latency without creating expensive long-term data file replicas. The data management system Rucio is very successfully handling the large amounts of ATLAS data. This has triggered interest in Rucio by other HEP experiments and communities. Several new features are being developed for Rucio: the Rucio input file mover has been added to the PanDA pilot code, the awareness of site file caches is being added, object stores for input and output files are supported and the automatic creation of archive files has