ATLAS utilisation of the Czech national HPC center

Abstract. The Czech national HPC center IT4Innovations located in Ostrava provides two HPC systems, Anselm and Salomon. The Salomon HPC is amongst the hundred most powerful supercomputers on Earth since its commissioning in 2015. Both clusters were tested for usage by the ATLAS experiment for running simulation jobs. Several thousand core hours were allocated to the project for tests, but the main aim is to use free resources waiting for large parallel jobs of other users. Multiple strategies for ATLAS job execution were tested on the Salomon and Anselm HPCs. The solution described herein is based on the ATLAS experience with other HPC sites. ARC Compute Element (ARCCE) installed at the grid site in Prague is used for job submission to Salomon. The ATLAS production system submits jobs to the ARC-CE via ARC Control Tower (aCT). The ARC-CE processes job requirements from aCT and creates a script for a batch system which is then executed via ssh. Sshfs is used to share scripts and input files between the site and the HPC cluster. The software used to run jobs is rsynced from the site’s CVMFS installation to the HPC’s scratch space every day to ensure availability of recent software. Using this setting, opportunistic capacity of the Salomon HPC was exploited.


Introduction
The Czech National Supercomputer Center IT4Innovations in Ostrava operates the Salomon HPC system which is the most powerful computer in the Czech Republic; it is listed in Top500 and ranked 87th in the world as of November 2017 [1]. Salomon was built in 2015, providing 2 PFLOPs in peak performance. It consists of 1008 computational nodes with 24 cores of Intel Xeon E5 CPUs and 128 GB of RAM per node interconnected with Infiniband (56 Gbps). In addition, 432 nodes contain 61 core Intel Xeon Phi accelerators. In total, there are 24192 cores of computational nodes with an additional 52704 accelerator cores. Salomon nodes run on CentOS 6 with PBSpro as a batch system. Nodes have very limited external connectivity through http/https proxy.
The ATLAS Experiment in CERN [2] uses the Salomon cluster in opportunistic fashion via the Czech Tier2 site (praguelcg2) [3]. Only non-accelerated nodes are available for opportunistic usage and unlike other opportunistic HPC resources, there is no job pre-emption. The HPC batch scheduler selects jobs of projects without dedicated computing time allocation to increase supercomputer utilization efficiency by filling empty batch slots with opportunistic jobs. Jobs waiting for resources for a long time are killed in the batch system and reassigned by ATLAS production system to another site to ensure timely completion of tasks.
The first successful ATLAS job submitted to the Salomon via ARC-CE finished in December 2017 and since then Salomon continuously and significantly contributes to Czech Tier2 ATLAS production output.

ARC-CE
Traditional pilot based grid jobs do not always deal well with the restrictive HPC environment. ATLAS developed ARC Control Tower (aCT) [4] to be more flexible with utilizing computing resources with specific configurations. ATLAS jobs coming from aCT are transformed by the ARC-CE [5] computing element into a script suitable for HPC's PBSPro batch system. The environment required by jobs is set via RunTime Environments (RTE) (see section 3). The computing element also downloads input files which are stored in the session directory together with the startup scripts and subsequently uploads output files. The standard ARC-CE interface for the Torque batch system was modified to accommodate differences with PBSpro batch system on Salomon and job management that must be done with ssh through an HPC login node. Another ARC-CE modification was necessary to provide HTTP proxy configuration for access to condition data: echo "export 'http_proxy=http://praguelcg2SquidIP:443'" » $LRMS_JOB_SCRIPT echo "export 'all_proxy=http://praguelcg2SquidIP:443'" » $LRMS_JOB_SCRIPT The changes in Torque commands (to go through ssh) are as follows: qsub /usr/bin/ssh -i path_to_id_rsa user@loginnode.salomon.it4i.cz "/opt/pbs/default/bin/qsub -A DD-13-2 $@" qstat /usr/bin/ssh -i path_to_id_rsa user@loginnode.salomon.it4i.cz qstat $@ qdel /usr/bin/ssh -i path_to_id_rsa user@loginnode.salomon.it4i.cz qdel $@ qmgr /usr/bin/ssh -i path_to_id_rsa user@loginnode.salomon.it4i.cz qmgr \"$@\" pbsnodes /usr/bin/ssh -i path_to_id_rsa user@loginnode.salomon.it4i.cz pbsnodes $@

RunTime Environment
RunTime Environments (RTEs) are a way to accommodate environments which are specific for given site/experiment/job/etc. RTE itself is a shell environment initialization script. As it is necessary for a job to run, its location must be directly accessible from a worker node (WN) (see section 4). The following RTEs are required by ATLAS jobs:

Software
Standard ATLAS jobs use executables, libraries, and data from the CVMFS [6] in the repositories atlas.cern.ch ( sw/local, sw/software/21.0, and ATLASLocalRootBase directories) and sft.cern.ch (binutils directory). The CVMFS is available on the ARC-CE, but not on worker nodes, therefore required files are synchronized onto the shared scratch area of the HPC with rsync once a day. Approximately 0.5 TB and 10 million files must become available through HPC's Lustre shared filesystem.

Cache
The analysis of input files reuse shows that many input files are used more than once. The analysis was performed on jobs that finished successfully between end of December 2017 and the beginning of March 2018. During this time period, there were 15k successful jobs (which means 15k input files) but that represents only 3k unique files (Figure 1). Some of these files were used up to 16 times. This was sufficient motivation to deploy an ARC-CE cache located on the shared file system. To introduce caching, modification of sshfs sharing was necessary as the ARC-CE needs to be able to make both soft and hard links within the cache area. OpenSSH was downloaded and compiled on the HPC. Then parameter -o sftp_server=path_to_sftp_server was added to the sshfs command.
The reduction of job start-up time is significant. Before the cache was deployed, it took O(1) h (left plot on Figure 2) to create jobs for all 100 slots. With the cache, it takes O(1) min (right plot on Figure 2). Each bin on plots on Figure 2 is 5 min. Cache size is O(1) TB because only files which were not used in the last 60 days are deleted.

Job statistics (January-May 2018)
The HPC is running only one type of ATLAS workload, simulation jobs. They have the best ratio of CPU intensity to I/O, i.e. they have relatively small input and output while having    • The latter failures were caused by a software installation issue. But as the wallclock plot (Figure 7) shows, it did not waste significant amount of resources. 5k ATLAS Computing The plot with CPU consumption of all good jobs in different active praguelcg2 queues (Figure 8) shows that the HPC provides significant amount of resources (about 15%).