Operation of the ATLAS Distributed Computing

. ATLAS is one of the four experiments collecting data from the proton-proton collisions at the Large Hadron Collider. The o ﬄ ine processing and storage of the data is handled by a custom heterogenous distributed computing system. This paper summarizes some of the challenges and operations-driven solutions introduced in the system.


Introduction
The ATLAS distributed computing(ADC) [1], [2] was designed to meet the computational and storage needs of the ATLAS experiment. The experiment has produced over 370 PB of data to date and it is adding new data at a rate of 1.5 PB per week (See figure 1). The data is handled by a central data distribution management system based on the Rucio data management system [3]. It is processed in different workflows centrally and by end users using the Athena framework. The traditional unit of processing work in ATLAS is the job. Every job is generated by the Panda workflow management system [4] from requests and tasks defined by the task definition system, ProdSys2, and dispatched to one of the 180 computing resources, in over 40 countries, according to the specific job requirements and resource properties. The computing resources at ATLAS' disposal consist of WLCG GRID sites [5], commercial and private Clouds, High-Performance Computers (HPC) and volunteer computing resources [6]. ADC currently executes on average 1.1 M jobs per day (figure 2) on 347k (figure 3) computing slots on average, peaking at 960k slots for several hours. The global average transfer throughput of the system is 12 GB/s (figure 4).

Operations
The ADC operations is a shared effort between two dedicated teams -DDMOps (Distributed Data Management Operations) and DPA (Distributed Production and Analysis) and a set of dedicated shifters spread over the globe to ensure 24/7 coverage.

DDM Operations
A few of the most important aspects of the DDMOps team responsibilities include but are not limited to: • Optimal storage utilization -data rebalancing, replication and deletion according to the current data processing needs.
• File transfers -ensure expected bandwidths are achieved.
• Data quality and consistency -tracing, deleting and recovering of corrupted data, local site and central monitoring and reporting consistency checks. • Site storages -central configuration, performance tests.
• Monitoring and reporting.
• Developer communication -reporting and following up on Rucio feature requests.

Distributed Production and Analysis
The main task of the DPA team is to ensure the efficient utilization of the computing resources in accordance with the current needs of ATLAS. Within the experiment there are a set of computational activities (workflows). Each workflow has appointed production managers who are responsible for following up on the requests for the physics groups and for fulfilling them within the deadlines. The priorities in sharing ATLAS computational resources between the different workflows are decided by the executive bodies of the collaboration and implemented on computing level via the "global shares" [7]. DPA also takes care of system troubleshooting as well as information flow between different stakeholders -the central pro-

System Optimizations
In order to achieve as high a resource utilization efficiency as possible, a constant system optimization is ongoing. The areas with highest efficiency gain are failure rate minimization, network, disk and tape storage utilization optimization as well as local batch system performance optimization. Two of the latest optimizations implemented will be discussed below.

I/O Intensity Cut-off
The I/O intensity of a job is defined as the input data size normalized to its wall time. The I/O intensity of a task is set as the average I/O intensity measured from the first 10 finished jobs ("scouts"). Based on this value, and the location of the input data, the job brokerage decides on which site the next set of jobs generated should be dispatched. If the set value is below the cut-off value, the jobs are dispatched to the selected sites. Otherwise, the jobs are run where the input data is. The threshold is chosen to optimize resource usage whilst reducing the amount of file transfer required. In the future, this binary cut-off will be replaced by a weight in the standard Panda brokerage. The default I/O intensity cut-off is 200 kB/s. An increase of the cut-off value increases the number of running slots of high I/O intensity jobs, effecting the overall number of running slots (figure 5) at the expense of increased network utilization (See figure 6). Adjusting the I/O cut-offs properly is a key factor for system performance.
Setting values too high results in transfer delays which in turn leads to low slot occupancy.

Panda Retrial Module
The Panda Retrial Module is a mechanism that can significantly reduce the lost CPU time from jobs which fail due to known software problems. It allows DPA to override the retrying mechanism based on error exit message. The default action for a failed job is to retry it a number of times, where the number is either manually set by the task submitter at definition time or it is hard coded per workflow in the system. In case it is known that a certain error will appear in a certain percentage of the jobs when running the task, with the retry module one can set an action only for these cases based on the error message. The actions currently supported are: • Do not retry.
• Increase memory requirement on the next retry • Limit the number of retries to a new value.
• Increase CPU time on next retry.
A Web UI (figure 7) simplifies the introduction of new retry rules significantly.

Conclusion
The distributed computing system of the ATLAS experiment at the LHC is necessarily complex due to the diverse workload requirements and heterogeneous resources available. New workloads must be accommodated at ever higher scales, whilst the underlying resources change beyond our control. All this demands strong operational and shift teams, and a continuous optimization procedure. We have described two recent examples and expect further such incremental improvements to be needed as we approach Run 3 and beyond.