Computing in the KM3NeT Research Infrastructure

The KM3NeT research infrastructure is currently being established. It will consist of a network of deep-sea neutrino detectors to answer fundamental questions both in particle physics and astrophysics. The complexity and volume of the generated datasets is a challenge for analysis and archival of the data, and requires significant computing resources. The KM3NeT collaboration adopts a tier-like computing model for data management. The KM3NeT data management plan and computing model is presented, and plans for public data with open access are discussed. 1 The KM3NeT research infrastructure KM3NeT is a large research infrastructure that will consist of a network of deep-sea neutrino detectors in the Mediterranean Sea with user ports for Earth and Sea sciences. The neutrino detectors share the same technology while targeting different neutrino energy regimes and physics goals. ARCA (Astroparticle Research with Cosmics in the Abyss) is a sparse gigaton-scale detector optimised for high-energy neutrino astronomy, and ORCA (Oscillation Research with Cosmics in the Abyss) is a dense megaton-scale detector optimised for measuring the oscillation of atmospheric neutrinos. Both detectors are basically 3D arrays of thousands of optical sensors that detect Cherenkov light from charged particles originating from collisions of neutrinos with atomic nuclei. The facility will also house instrumentation for Earth and Sea sciences for long-term and online monitoring of the deep-sea environment [1, 2]. Further details on the main science objectives and a description of the technology is presented in the KM3NeT 2.0 Letter of Intent [2] and in other contributions of this conference. The KM3NeT research infrastructure is constructed in a phased and distributed approach. Each building block (BB) comprises logical and technical sub-units of the respective detectors, and is sufficiently large that partitioning is possible without penalty in the physics outcome. The current construction sites are East of Sicily (KM3NeT-It) for ARCA and South of Toulon (KM3NeT-Fr) for ORCA. Both detectors are currently under construction and will together consists of three BBs. A future phase with additional BBs is under consideration but not yet in the planning phase. This document is mainly based on the "KM3NeT Data Management Plan" [3] and "KM3NeT report on conceptional design of open data generation, archiving, test programs and access" [4], which both can be found on [1]. ∗e-mail: jannik.hofestaedt@fau.de © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). EPJ Web of Conferences 207, 08001 (2019) https://doi.org/10.1051/epjconf/201920708001


The KM3NeT research infrastructure
KM3NeT is a large research infrastructure that will consist of a network of deep-sea neutrino detectors in the Mediterranean Sea with user ports for Earth and Sea sciences. The neutrino detectors share the same technology while targeting different neutrino energy regimes and physics goals. ARCA (Astroparticle Research with Cosmics in the Abyss) is a sparse gigaton-scale detector optimised for high-energy neutrino astronomy, and ORCA (Oscillation Research with Cosmics in the Abyss) is a dense megaton-scale detector optimised for measuring the oscillation of atmospheric neutrinos. Both detectors are basically 3D arrays of thousands of optical sensors that detect Cherenkov light from charged particles originating from collisions of neutrinos with atomic nuclei. The facility will also house instrumentation for Earth and Sea sciences for long-term and online monitoring of the deep-sea environment [1,2]. Further details on the main science objectives and a description of the technology is presented in the KM3NeT 2.0 Letter of Intent [2] and in other contributions of this conference.
The KM3NeT research infrastructure is constructed in a phased and distributed approach. Each building block (BB) comprises logical and technical sub-units of the respective detectors, and is sufficiently large that partitioning is possible without penalty in the physics outcome. The current construction sites are East of Sicily (KM3NeT-It) for ARCA and South of Toulon (KM3NeT-Fr) for ORCA. Both detectors are currently under construction and will together consists of three BBs. A future phase with additional BBs is under consideration but not yet in the planning phase.
This document is mainly based on the "KM3NeT Data Management Plan" [3] and "KM3NeT report on conceptional design of open data generation, archiving, test programs and access" [4], which both can be found on [1].

Data volumes
Each BB consists of about 64 thousand photomultiplier tubes (PMTs). The count rate per PMT is about 5-10 kHz, and is dominated by optical background noise. All PMT signals above a certain threshold are transmitted to shore and processed there by a computer farm ("all-data-to-shore"). The resulting data volume per BB amounts to about 5 GB/s.

Computing requirements
The expected data volumes and required computing resources for one BB are summarised in Table 1, separated into different categories: raw data, experimental data processing and simulated data processing. The second and third column detail the storage size and CPU time consumption per processing of the data. In the forth to sixth column the computing requirements for a reasonable periodicity how often the data is processed per year is estimated. In terms of storage size, the largest consumption are the intermediate steps when calibration data is added to the raw data to get the reconstructed data. Raw data itself and the simulation output is of course also contributing significantly. In terms of CPU time, the largest consumption is due to performing the event reconstruction. The calibration is relatively fast, hence it is planned not to store intermediate data processing steps. Therefore, the quoted total sum in storage size is without saving the intermediate data processing steps. The total storage size is about 1000 TB per BB and year, and about 350M CPU hours in HEPSPEC-06 benchmark metric. Note that these estimates are subject to change, as they are based on the current status of the data processing software.

KM3NeT computing model
Due to the large data volumes and significant computing requirements (see Section 2.2) a computing model for the distribution and processing of the KM3NeT data has been established. It is based on the LHC computing model [5] with a hierarchical tier-like structure.

Tier
Computing facility Processing steps Tier-0 at detector sites triggering, quasi-online calibration and reconstruction Tier-1 computing centres calibration, reconstruction, simulation Tier-2 local computing clusters development, simulation, analysis The general scheme of the KM3NeT computing model is summarised in Table 2. A more schematical representation of the data management is shown in Figure 1.
At the lowest layer (Tier-0), computing farms filter the raw data in real time and select interesting events. This process is commonly known as event triggering. The computer farms are located in the shore station of each detector site. The main requirement is the reduction of the overall data rate (from 5GB/s to about 5MB/s for one BB). The output data is distributed to various computers centres (CC-IN2P3 and CNAF). For detector monitoring, specific monitoring data is recorded. For alert handling, selected events are reconstructed using fast algorithms with a low latency of few seconds to minutes ("quasi-online reconstruction").
In the next layer (Tier-1), the filtered data is processed. Various dedicated event reconstruction algorithms are applied. For optimal reconstruction resolution, a detector calibration is applied using calibration data to determine the time offsets, positions and efficiencies of the PMTs. The final reconstruction results are stored for further analysis. In this layer also dedicated simulations are processed, allowing for a direct comparison of experimental data with simulation expectations.
The last layer (Tier-2) is formed by the local computing clusters of the participating institutes, where more user-specific applications are processed, like physics analyses with the reconstructed data and the simulated data. Data transfer between the computing centres is based on GRID access tools (e.g. xrootd, iRODS, and gridFTP).

Data management plan
The data management plan is a fundamental part of the KM3NeT computing model (see Section 2.3) and follows the 'Guidelines on FAIR Data Management in Horizon 2020' [6] by making the data findable, accessible, interoperable and reusable (FAIR). The FAIR concept also applies to the public data that will be released (see Section 2.5).
The KM3NeT collaboration has developed measures to ensure the reproducibility and usability of all scientific results over the full lifetime of the project and additionally ten years after shutdown. Low-level data and high-level data will be stored in parallel at the two main storage centres (CC-IN2P3 and CNAF). A specialised group of experts from the collaboration will process the data from low-level to high-level and will provide data-related services.
Internally, the primary data are stored in the ROOT [7] format. For high-level analyses and for the open data generation, high-level data will be converted to other standard formats (e.g. HDF5 [8] and FITs [9] format, standards defined by the ASTERICS project [10]). Metadata is provided in customisable format, like ASCII text format.

Open-access data policy
High-level data (neutrino data, atmospheric muon data, environment monitoring data) will be distributed with open-access via a web-based interface. The implementation is not defined yet. A possibility is including the data in a Virtual Observatory. All metadata necessary to uniquely identify data sets will be provided. Simulation data will also be provided for comparisons with expectations to allow an optimal use of open-access data. Together with the data, the software to read and analyse the data will be provided, thus creating an open data access platform; ideally within the EOSC [11].
The data is made publicly available after an embargo time. The duration of the embargo time is not defined yet. The embargo time is given the KM3NeT collaboration as a return for hosting and operating the facility and allows the collaboration to process and exploit the generated data first. It is also required to produce high-quality data for public dissemination.