WPEC SG50: Developing an Automatically Readable, Comprehensive and Curated Experimental Nuclear Reaction Database

. The Organisation for Economic Co-operation and Development (OECD) Nuclear Energy Agency (NEA) Working Party on International Nuclear Data Evaluation Co-operation (WPEC) subgroup (SG) 50 was formed in 2020 to develop an automatically readable, comprehensive and curated experimental nuclear reaction database. This database is called MEDUSAL (Machine-readable Experimental Data User Application & Library), and will draw from EXFOR. The EXFOR database preserves experimental nuclear reaction data true to its original documentation and information from the authors of the data. MEDUSAL will deviate from EXFOR by storing additional information from users of the data for their ﬁelds of work (evaluation, model development, validation, etc.). This includes expert judgment on the data sets, identiﬁcation of data points as outliers, renormalization of the data to the newest monitor reactions, and estimations of missing uncertainty sources. The format for MEDUSAL is being developed to enable easy automatic parsing of large amounts of data. Here, we will summarize the use cases, high-level requirements and ﬁrst steps towards developing the database MEDUSAL and the API to access it.


Introduction
The EXFOR database [1,2] has for decades served as the starting point for nuclear reaction evaluations. Recently, increasing interest in automation and reproducibility has led to more users with varied needs. Some of these needs conflict with the philosophy of EXFOR, which is to compile and document the data sets as published and intended by the authors. These conflicts could be addressed by creating an additional database that draws from the EXFOR database but allows for further deviations from the EX-FOR format and philosophy, beyond the current work to make the EXFOR data more accessible [3,4]. To compile the needs of the users and make recommendations on the layout and format of such a database, an international collaboration, Working Party on International Nuclear Data Evaluation Co-operation (WPEC) subgroup, SG50, has been established through the Organisation for Economic Co-operation and Development (OECD) Nuclear Energy Agency (NEA).
WPEC SG50 aims at developing an automatically readable, comprehensive and curated experimental reaction database that was termed "MEDUSAL" (Machinereadable Experimental Data User Application & Library). The goal of SG50 is to identify the needs of these varied users, and then design a database to meet those needs. The subgroup will produce example files and prototype * e-mail: amanda.lewis@unnpp.gov * * e-mail: dneudecker@lanl.gov codes for translating and updating dataset files to allow for testing and feedback on the proposed specifications. The database users identified are described in §2, and the highlevel requirements compiled for these users in §3. The resulting hierarchy is given in §4. In §5, examples of the metadata requirements and specifications are given, showing how they are connected to the high-level requirements. Finally, §6 summarizes the progress on implementation of the database and creation of example files.

Database Users
The users of MEDUSAL were identified to enable compilation of their high-level requirements: • Nuclear data evaluators who retrieve experimental data for a specific reaction, re-normalize them to a common monitor, add missing uncertainties, truncate or reject data as necessary, and estimate total covariances.
• Experimentalists who use experimental data and metadata to understand how past experiments were undertaken and what quality of data was achieved, • Model developers who download large amounts of data across isotopes and reactions to develop new models or parameterizations, • Machine learning and artificial intelligence users, who need data sets and metadata to be unique and easy to read automatically, to better design future experiments or to understand possible hidden biases in data related to metadata, • Nuclear reaction library validators, who use experimental data to judge the quality of nuclear data in libraries that may have been built upon these experimental data.

High-Level Database Requirements
From the users defined above, the following high-level requirements were compiled: High-Level Requirements 1. The data in MEDUSAL should be accessible either by selecting search criteria or by downloading the entire library. Different levels of curation should be made available, and there should be an option for users to feed information into MEDUSAL, 2. The format should be easy to read automatically for large numbers of data sets. This requires that the keys for metadata should be unique and parsable, and that observables should be clearly identified, 3. Data should be curated, i.e., renormalized to current monitor reactions, missing uncertainties should be estimated by templates of expected measurement uncertainties [5][6][7][8][9] and total covariances should be provided, 4. MEDUSAL should enable storing expert judgment on data and versions as used in evaluations.
These high-level requirements act as the basis for the design of the database hierarchy, requirements, and specifications.

Database Hierarchy
The design of the database is intended to best meet the High-Level Requirements explained above. Multiple "layers" will allow for all of the data treatment and storage needs, and a new definition of a single "data set" will allow for the access and format needs.

Database Layers
The database will have four "layers". The first layer, Layer 0, will be the first step in the translation from the EX-FOR format into the MEDUSAL format, and will have the data converted into consistent units. Layer 1 will be the first layer accessible by default, and will contain all of the information in Layer 0, along with what metadata is expected based on the parsable information. In addition, this layer can hold all information about the data set available in free-text in EXFOR and in the literature. Layer 2 is the first layer to allow for a departure from the information presented by the authors. This includes renormalization to newer standards or reference data and flags based on low uncertainties or data values discrepant from other measurements of the same observable (to meet High-Level Requirement 2). Layer 3 can hold multiple versions of the same data set, allowing for "subjective" corrections or suggestions on the data sets, including documentation of the data set exactly as it was used in an evaluation (to meet High-Level Requirement 4). Such documentation is extremely important for evaluation reproducibility.

Data sets
The definition of a single data set in MEDUSAL is focused on High-Level Requirement 2, "observables should be clearly identified". The categorization will allow for clear connection between the data sets and the observables seen in evaluated libraries (denoted by MF/MT numbers [10]) and produced by reaction modeling or fitting codes. The scope of SG50 is limited to describing the observables that do relate to MF/MT numbers, but future work can expand the categories to include other observables, such as transmission and capture yield measured in the resolved resonance region. The definition of a data set in MEDUSAL as a single observable will mean that in some cases, a single EXFOR subentry will be split into multiple MEDUSAL data sets.

Metadata Requirements and Specifications
As defined in the high-level requirement 2, MEDUSAL should have a format for metadata that reduces free text and that provides unique identification of all metadata. These needs have shaped the metadata requirements and specifications described in this section. The requirements and specifications are split into several broad containersobservable, bibliography, incident particles, detectors, samples, response functions, reference data values, and correlations to other data sets. In addition, general containers to hold the information have been defined.

Requirements
The metadata requirements define the types of information that MEDUSAL should be able to store. No single data set will have all of the types of information defined in the requirements, as the database is designed to store many different kinds of experimental data. A preliminary set of metadata requirements has been created for all of the categories of information.
An example is shown with the samples container. The only requirement on the samples container is that it can hold an unlimited number of sample containers, one for each sample used in the experiment to measure the observable, monitor reaction, flux, etc:

Specifications
The requirements specify only what information can be stored in the database, not how the information should be stored. That is defined in the database specifications, which cover all requirements. For example, the specifications for the samples container define how the multiple samples are formatted: The specifications of the sample container define the allowed attributes which should cover all sample requirements. The attribute to meet requirement 6.1, for example, is: sample type: use A data type to describe what the sample is used for, such as the measurement of the observable, the monitor or standard, or the flux directly.
Specifications for use: The choice:list data type is one of the general containers defined for the database. Each attribute of type choice:list will have a defined list of allowed values. A preliminary set of specifications covering all of the categories has been created, and will continue to be refined as example data sets are created and tested.

Implementation
Example data sets are needed for testing and feedback of the database format, hierarchy, and API. The codes used to create these example data sets can serve as prototypes of the programs that will be needed to create and access the final database.

Translation
The first step is to translate EXFOR entries into MEDUSAL data sets in Layer 0. This requires splitting up subentries into the appropriate number of data sets, determining the correct MEDUSAL observable category for each, and translating all parseable information from the EXFOR entry into the MEDUSAL data types defined in the specifications (see §5.2). Finally, this step requires converting the data values into a set of defined units, which will be consistent where possible.

Creating Higher Levels
In the follow sections, the steps to create higher level data sets from lower level data sets is referred to as "updating" the data sets. The code to update from Layer 0 to Layer 1 will check the Layer 0 dataset against the Layer 0 schema, add in expected metadata, and check the produced Layer 1 data set against the Layer 1 schema. Any metadata expected not available in the MEDUSAL format will be clearly marked, which will indicate the fact that the information is not necessarily unknowable, it is just not currently available in MEDUSAL data types. This updating step will create basic Layer 1 files with limited information that can be filled out by users of the database in the future. There will be analogous updating codes to go from Layer 1 to Layer 2 and Layer 2 to Layer 3, which will apply all changes that can be performed automatically, such as renormalization to updated standards and reference data.

Database Access
The database API should meet High-Level Requirements 1 and 4, and will allow for users to access Layers 1 through 3 by downloading large amounts of data at once or searching for very specific data sets. The API will also allow for users to feed back into the database Layers 1 and 3. In Layer 1, users will be able to provide information about the experiment that is available in free-text or literature that has not yet been added to the database. In Layer 3, users will be able to provide comments on the data sets and entire versions of the data sets. These different versions of the data sets can have "objective" corrections to the databases that are not provided by the experimental authors but are supported by other measurements of the same observable, other measurements at the same facility, and/or that were used by an evaluator in an evaluation. Current work is focused on a protoype of the API to access Layer 1, which will allow for testing and feedback.

Conclusions
Increasing interest in reproducibility and automation in nuclear reaction evaluations has lead to more varied users of, and needs on, the differential experimental data in the EXFOR database. WPEC SG50 aims to compile these users and needs and design a database that will draw on EXFOR and allow for more deviations from the EX-FOR philosophy and format to meet these needs. SG50 has created a list of users and high-level requirements, along with preliminary metadata requirements and specifications. Work is ongoing to create prototypes of the codes needed to create the database and the example data sets that will allow for testing and improvement of the database design.