DUNE Production processing and workﬂow management software evaluation

. The Deep Underground Neutrino Experiment (DUNE) will be the world’s foremost neutrino detector when it begins taking data in the mid-2020s. Two prototype detectors, collectively known as ProtoDUNE, have begun taking data at CERN and have accumulated over 3 PB of raw and reconstructed data since September 2018. Particle interaction within liquid argon time projection chambers are challenging to reconstruct, and the collaboration has set up a dedicated Production Processing group to perform centralized reconstruction of the large ProtoDUNE datasets as well as to generate large-scale Monte Carlo simulation. Part of the production infrastructure includes workﬂow management software and monitoring tools that are necessary to e (cid:14) ciently submit and monitor the large and diverse set of jobs needed to meet the experiment’s goals. We will give a brief overview of DUNE and ProtoDUNE, describe the various types of jobs within the Production Processing group’s purview, and discuss the software and workﬂow management strategies are currently in place to meet existing demand. We will conclude with a description of our requirements in a workﬂow management software solution and our planned evaluation process.


Introduction to DUNE
The Deep Underground Neutrino Experiment is an international mega-science project that aims to probe the nature of the universe through a variety of methods such as studying neutrino oscillations, searching for proton decay, and collecting a large neutrino flux from a core-collapse supernova in our galaxy. The collaboration currently has over 1200 members in over 30 countries. The full experiment will consist of a Near Detector at Fermilab, itself containing three sub-detectors, and a Far Detector at SURF in South Dakota, consisting of four 10-kt liquid argon time projection chambers utilizing various readout technologies.
In addition to the main detectors two prototype detectors are currently operational at CERN: ProtoDUNE Single-Phase (SP) and ProtoDUNE Dual-Phase (DP). These detectors utilize the same technology and design (at approximately 5% scale) intended for the Far Detector modules and serve as technology demonstrators for the larger project. The ProtoDUNE-SP detector took approximately six weeks of data with beam on a staggered schedule from September to November 2018. Another beam run is planned for 2022 with both the SP and DP detectors. In the meantime both detectors continue to take cosmic ray data and run a variety of calibration and DAQ system studies. Experience with ProtoDUNE * for the DUNE Collaboration * * e-mail: kherner@fnal.gov is vital to building expertise in maintaining liquid-argon TPCs over the long periods of time that DUNE will require.

The DUNE Production Group
The DUNE Collaboration already has an active and expanding computing effort. Ref. [1] provides an overview of the current state of DUNE computing. Within the overall Computing Consortium, the Production Group is currently responsible for running central reconstruction on ProtoDUNE raw data, large-scale Monte Carlo (MC) generation for all detectors, and works with the data management group to curate datasets. Additionally tests jobs submitted by Production group members are typically the first jobs sent to test new sites as they join the DUNE computing effort. Group members also work very closely with the data management group to test future storage solutions, and with service providers to test current and future workflow management systems. The group size has varied from two to six people over the past two years.
The main consumers of computing resources have been the ProtoDUNE reconstruction campaigns. The initial reconstruction campaign for the ProtoDUNE-SP data was from September to December 2018, and a reprocessing pass occurred between late August and early October 2019. Table 1 provides information on the CPU and storage utilization on the most recent ProtoDUNE-SP reprocessing campaign.
Raw data (SP+DP) 3329 TiB Raw "physics" beam data (SP) 786 TiB Reco output size (SP "good" physics runs only) 169 TiB Wall hours (beam data+new MC) 2.08 M Table 1. Statistics on the most recent ProtoDUNE-SP reprocessing campaign, August-October 2019. "Physics" data is defined as data taken with beam, as opposed to cosmic ray or DAQ test data. "Good" runs are those runs taken with beam when the DAQ system and detector were fully stable.

Current production workflow submission infrastructure
Production jobs use Fermilab's Production Operations Management System (POMS) [2,3] for submission. POMS provides both GUI and command-line options for job launches (both immediate and scheduled), recovery project setup, and provides integrated monitoring with Fermilab's Landscape project. Job submission is typically via Fermilab's Jobsub tool [4]. Jobsub in turn interfaces with the GlideinWMS workflow management system [5] for resource provisioning and matchmaking to slots at Fermilab, on dedicated DUNE resources at other sites, or opportunistic cycles on the Open Science Grid (OSG) [6]. Figure 1 illustrates the entire chain, including interaction with storage elements within the job. The architecture can also provision resources on High-Performance Computing (HPC) resources such as NERSC with the HEPCloud [7,8] infrastructure. The submission mechanism is unchanged whether the jobs are HTC or HPC; this seamless transition is key to efficiently utilizing available resources and also saves the job submitter significant effort by not requiring customized submission infrastructure for different resource types. Choosing a Glidein-based system at this stage of the experiment had several advantages. DUNE was able to quickly leverage the existing FIFE [2] toolset include POMS and Jobsub, negating the need for significant effort from the experiment in getting jobs running quickly. Figure 1. Overview of the current DUNE Production workflow setup used also by the ProtoDUNE detectors for data reconstruction and simulation. Production group members interact with POMS to submit jobs, which uses the Jobsub tool to submit jobs to a HTCondor schedd. GlideinWMS provisions worker node resources and jobs match to the available worker node slots. DUNE jobs interact with storage elements both at Fermilab and other sites both for input copy (or streaming for most production workflows) and output copyback.
Since the system is in use by other neutrino experiments at Fermilab, it is easy for collaboration members coming form these experiments to begin submitting jobs quickly as they are working with a familiar system. The DUNE Production workflows were also able to leverage the existing infrastructure support teams in place to server other collaborations and consortia such as CMS and OSG. Finally, as GlideinWMS is widely used in HEP, setting up new sites becomes extremely straightforward, especially if the site is already supporting another experiment that uses GlideinWMS. Our integration times for new DUNE sites typically less than one week and successful production jobs immediately after opening up the site are now the rule rather than the exception. This ease of setup has been a key enabler of DUNE's international expansion over the past 18 months. From January to November 2019, sites outside the United States delivered approximately 49% of the total DUNE Production wall hours as shown in Figure 2. International sites regularly run the full suite of DUNE jobs, including ProtoDUNE data reconstruction.

Future workflow management software evaluation
The existing POMS+GlideinWMS infrastructure should be adequate to meet DUNE's needs through the second ProtoDUNE run planned for 2022. As DUNE constructs the Near and Far Detectors, it is clear that the computing challenges will be much greater than they currently are, and the current infrastructure will not suffice. The main needs are: the ability to quickly provision resources on a variety of architectures, including traditional High Throughput Computing, HPC, and GPU; the ability to schedule large blocks of jobs on HPC machines when work demands it or when such resources are foreseen to be coming available soon; and support for so-called "pipeline" workflows, which are multi-stage workflows where each stage may require a different type of resource. A provisioning system should be aware of these needs and begin provisioning the resources required for the N+1 th stage while jobs for the N th are running so that the resources for the next stage are available immediately after the conclusion of the previous stage.
The POMS system will continue to evolve to meet DUNE's requirements, and DUNE will consider continuing to use the POMS system as it is upgraded. DUNE is also considering two other existing workflow management systems: PanDA [9], currently used by the ATLAS Collaboration, and DIRAC [10], currently used by LHCb and several other medium-scale experiments. At the moment DUNE has not finalized its formal requirements for functionality in a future system. It is clear, however, that the requirements should be driven by the full DUNE computing model, which takes into account limitations such as total experiment storage and network bandwidth. They should not be driven by any attempts to fit into any specific existing setup.

Summary
The DUNE Production environment builds on successful job submission and workflow management systems also used by other experiments. The Production Group is responsible for large-scale data reconstruction and simulation generation, and plays major roles in computing site commissioning and workflow management software evolution. Several workflow management systems are under consideration by DUNE for long-term adoption, and the evaluation process will be based on the requirements of the full DUNE computing model.