The LZ UK Data Centre

LUX-ZEPLIN (LZ) is a Dark Matter experiment based at the Sanford Underground Research Facility in South Dakota, USA. It is currently under construction and aims to start data taking in 2020. Its computing model stipulates two independent data centres, one in the USA and one in the UK. Both data centres will hold a complete copy of the experiment’s data and are expected to handle all aspects of data processing and user analysis. Here we discuss the set-up of the UK data centre within the context of the existing UK Grid infrastructure and show that a mature distributed computing system such as the Grid can be extended to serve as a central data centre for a reasonably large non-LHC experiment.


LZ
The LUX-ZEPLIN (LZ) experiment [1] grew out of a collaboration of the now completed LUX [2] and ZEPLIN [3] experiments. The resulting LZ collaboration consists of about 250 scientists in 37 institutions mainly in the US and the UK, but also Portugal, Russia, and Korea. Its primary objective is the search for high mass weakly interacting massive particles (WIMPs). The LZ detector is based around a 7-tonne liquid xenon volume and is currently being constructed ∼1.4 km underground in a disused mineshaft at the the Sanford Underground Research Facility, USA (SURF). Data taking is scheduled to start in 2020.

GridPP and the Imperial College Grid site
GridPP [4] is a collaboration of 19 UK institutes providing computing resources (CPU and storage) to particle physics and related experiments in the UK. The main consumers of its resources are the LHC experiments. GridPP currently comprises about 70k compute cores and 50 PB of storage across all sites; this is expected to grow in the future. The Imperial College Grid site provides around 6k cores and 5 PB of storage. It serves as a Tier-2 for the CMS and LHCb experiments, but also supports over a dozen other experiments at the same time.

The LZ Computing Model
An overview of the LZ computing model is shown in figure 1. At its core are two data centres, one in the USA (USDC) and one in the UK (UKDC). Both data centres are expected to hold a complete copy of the LZ data. The raw data are transferred from the detector at SURF to the USDC. From there the data are simultaneously archived to tape and replicated to the UKDC. Monte Carlo data is generated at both data centres and transferred between them. During stable periods of running the two data centres plan to concentrate on different aspects of reconstruction and analysis. However, both data centres are expected to be able to handle all aspects of data processing, including user analysis in order to provide a fail-over should one of the data centres be temporarily unavailable. An overview of the predicted CPU and storage requirements for both data centres can be found in table 1. It should be noted that these numbers indicate CPU usage averaged over a year; peak demand can substantially vary from this. See figures 4a and 4b for actual CPU usage during the two LZ Mock Data Challenges. Figure 1: The LZ computing model. The UKDC is circled in red. EVT refers to raw data and RQ ("reduced quantities") to reconstructed data respectively.

Set-up of the UK Data Centre
Unlike the USDC where both storage and processing facilities are located at the National Energy Research Scientific Computing Center (NERSC), the UKDC uses a distributed approach on GridPP-provided computing resources. The data themselves will be hosted at the Tier-2 at Imperial College London, while their processing and analysis will take place at various UK Tier-2s and possibly other collaborating European institutes. Currently Imperial College provides 650 TB for LZ in a dedicated pool group on their dCache [5] based Storage Element, ahead of the requirements shown in table 1. Data access is provided via GridFTP [6] or XRootD [7]. Data Transfers between NERSC and Imperial College will use SPADE [8], a Java based data transfer tool, originally developed for the IceCube experiment. While this is being set up, data has been transferred between the two data centres using the FTS [9] server based at Imperial College.

Considerations for non-LHC VOs using distributed computing
At the time when the UKDC was commissioned, the LZ collaboration had no experience with distributed computing, apart from a few individual members who had encountered it within the context of other, often LHC, experiments. There was considerable concern within the collaboration as to whether LZ's workflows could be adapted to run on distributed resources. This led to the development of a custom web submission interface where considerable effort went into providing an intuitive interface to the UKDC, hiding much of the complexity of the underlying infrastructure from the end users. The main interfaces of the so-called Job Submission Interface (JSI) are the data manager page, giving an overview of the current state of data processing, shown in figure 2 and the user request interface shown in figure 3. To create a new workflow, a user chooses a specific software type (e.g. "DER") and the interface will then only display valid version numbers of this software for the user to pick. While any LZ user is allowed to submit requests via the JSI, only requests approved by the designated production managers are processed. The JSI is currently used for the majority of LZ production and reconstruction workflows at the UKDC.

Technical implementation
LZ uses the GridPP DIRAC instance [10] to access GridPP resources. The LZ Job Submission Interface (JSI) is built on top of DIRAC's Python API. The production interface consists of backend and a frontend components. The backend is solely responsible for communication with the Imperial College DIRAC instance while the user interacts with the frontend. The frontend and backend run independently of each other, communicating via an SQL (MariaDB [11]) database. In the current set-up frontend and backend run on separate hosts, which allows for isolation between the frontend and backend. The user interface is provided over an encrypted HTTPS connection and is based on the Python cherrypy light web framework [12]. User authentication is handled by an Apache reverse proxy, which requires the user's browser to present a valid X509 grid certificate. Authorisation is performed by comparing the user's DN to the list of valid LZ users obtained from the LZ VOMS server. All user actions are passed via the AJAX client to the web-server daemon via a RESTful interface whereupon the database is updated. The backend consists of two daemons. The first is responsible for polling the database handling all requests that are in a non-final state. Approved requests are submitted to the DIRAC daemon, and the returned status updates are written back to the database. The second daemon is a DIRAC client wrapped in a lightweight XML RPC server. This allows isolation of the DIRAC client which requires a different environment from the rest of the backend code. The RPC server runs on the loopback address making it only available on the backend machine. Both the LZ software and code repositories, hosted on CVMFS [13] and GitLab [14], are used as input to ensure that production requests conform to LZ software constraints, e.g. valid software versions, approved configurations etc.
While not developed for general purpose use the code is open source, available on GitHub [15].

Mock Data Challenges
The LZ production set-up has been thoroughly tested during two successful Mock Data Challenges (MDCs). During the first MDC in 2017 LZ generated and analysed the equivalent of one month of data (∼130 TB in 700k files). After resolving a number of technical issues identified during this challenge, MDC 2018 was run, processing six months worth of data. For the MDC in 2017 all of the data were generated on the UKDC, whereas in 2018 half was generated on the UKDC and the rest on the USDC. Figures 4a and 4b show the number of jobs run on GridPP resources during the two MDCs. A further challenge simulating and processing a further equivalent of 6 months of data is planned for 2019.