A computational EXFOR database

The EXFOR library is a useful resource for many people in the field of nuclear physics. In particular, the experimental data in the EXFOR library serves as a starting point for nuclear data evaluations. There is an ongoing discussion about how to make evaluations more transparent and reproducible. One important ingredient may be convenient programmatic access to the data in the EXFOR library from high-level languages. To this end, the complete EXFOR library can be converted to a MongoDB database. This database can be conveniently searched and accessed from a wide variety of programming languages, such as C++, Python, Java, Matlab, and R. This contribution provides some details about the successful conversion of the EXFOR library to a MongoDB database and shows simple usage examples to underline its merits. All codes required for the conversion have been made available online and are open-source. In addition, a Dockerfile has been created to facilitate the installation process.


Introduction
The EXFOR library [1] as a comprehensive collection of experimental reaction data is a valuable resource for many people in the field of nuclear physics. The Nuclear Reaction Data Centers (NRDC) host online services to search for data, visualize, and retrieve them in various formats. In particular, both the flexible search options provided by those online services and the simple structure of so-called computational formats, such as C4, are tremendously helpful in nuclear data evaluations. Notwithstanding the indisputable value of existing services and formats, there may be use cases involving reaction data from EXFOR which are not optimally covered by existing services and formats yet: 1. There is an ongoing discussion in the community about how to make evaluations more transparent and reproducible and the idea of automated evaluation pipelines gains momentum. Such a pipeline is a sequence of scripts in a programming language to perform an evaluation. Choices of an evaluator are implemented as instructions in scripts, thereby removing the need for manual intervention in the event that an evaluation needs to be reproduced. The creation of such scripts should be facilitated as much as possible to enable evaluators to focus on the essential evaluation work. Also readability counts: Other evaluators should later be able with as less effort as possible to understand the scripts. Convenient access in just a few lines of code to the EX-FOR library, which means both searching and retrieving data, from a variety of programming languages serves this goal. * e-mail: georg.schnabel@nucleardata.com 2. It can be foreseen that more and more sophisticated methods from statistics and machine learning will be used to analyze the data in the EXFOR library. The variety of available algorithms and the rapid development of new ones suggest that it may be difficult to come up with a computational format that suits all potential requirements. In this application scenario, the user should be put in the position to quickly create a customized computational format themselves and traverse the EXFOR library in any way they want.
The solution put forward in this contribution to enable user-friendly access to information in the EXFOR library is to convert the complete EXFOR library to a MongoDB database [2]. This database application and the associated database architecture is very popular in the commercial world. It can be downloaded and used free of charge. Due to its widespread adoption, MongoDB databases can be accessed from a wide variety of programming languages including C++, Python, Java, Perl, Matlab and R. Being endowed with a rich query language and the possibility to access any textual and non-textual information suggest that a MongoDB database is a viable solution for the two described use case scenarios above.
In this contribution, we first discuss the original EXFOR format [3] in comparison with the JSON format [4]. Afterwards we discuss the creation of a Mon-goDB database based on EXFOR data given in the JSON format. Then we provide a simple example of how the MongoDB EXFOR database can be accessed from Python. A conclusion section ends this document.

From EXFOR to JSON
The EXFOR format [3] is a data exchange format designed to enable the storage of numerical data and textual information related to experiments. Data associated with an experiment is bundled together as a so-called entry. An entry comprises several subentries. The first subentry contains general bibliographical information, such as the authors of the measurements or the facility, and specifications common to all subsequent subentries. From the second subentry onwards, each subentry contains the data associated with a specific reaction process measured in the experiment.
The EXFOR format defining the structure of an entry was conceived to be readable by humans and machines. In order to be readable by machines, it follows rigid structural rules. For instance, a line contains two or more fields and the size of a field must be a multiple of eleven characters. Maximal 66 character slots in a line can be used to accommodate fields. Many lines are dedicated to store key-value pairs: The first field of size eleven specifies a keyword, such as AUTHOR, and the second field of size 55 contains the associated value, e.g., the names of the authors.
The rigid structure is in principle a good thing as it facilitates to write programs to parse the content in EXFOR entries, making the content accessible in a high-level programming language. However, at the time of writing and to the knowledge of the author, no such parser is readily available for download. The notable exception may be the X4TOC4 code in the EMPIRE code system, which can be requested from the Nuclear Energy Agency [5].
Ambitious users are left with implementing one themselves or find someone who already implemented one. This situation is certainly far from optimal and a huge burden for newcomers that want to deal with EXFOR data in more flexible ways than offered by available tools.
Furthermore, even though the EXFOR format is human-readable, modifications by hand bear the risk of inadvertently violating format specifications. For instance, the BIB keyword introducing a block with bibliographical information is followed by two fields indicating the number of keywords and the number of lines in the block. The entry is easily put into an inconsistent state by adding a bibliographical field by hand without changing accordingly such counter fields. Here one may argue that users do not need to change EXFOR entries themselves. However, it can be countered that format specifications may also be motivated by the needs of users. If users have the option to create their individual EXFOR entries with customized fields, e.g., an alternative representation of covariance matrices, pertinent in their domain of application, this could provide helpful input for potential extensions of the official EXFOR library in the future.
Both problems, the need to write a parser and less error-prone modifications by hand, can be solved by converting the entries given in the EXFOR format to another format that is widely supported in the field of information technology. A natural candidate for that purpose is the JSON format [4]. It is a hierarchical format and can store numerical and textual data. Therefore it provides all the features to store without any loss of information or accuracy all the information available in an EXFOR entry.
An R package to convert entries or subentries given in the EXFOR format to the JSON format has been implemented and is available for download [6]. The philosophy of the converter is to preserve the logical structure of the original EXFOR entry as much as possible. Keywords in the original entry are also keywords in the JSON object.
Once an EXFOR entry or subentry is available as JSON object, one can make use of existing functionality in most of the popular programming languages to retrieve information or manipulate the JSON object. Here is a simple example that shows how to extract information from an EXFOR entry given as JSON object using Python: As this example shows, having the EXFOR entries stored as JSON objects greatly facilitates the access to the information. Also the creation of customized computational formats becomes doable by users without too much efforts, e.g., by writing a small Python script to this end. However, it is equally important to be able to effectively search for relevant data, which is the topic of the next section.

About the MongoDB database
Storing a large collection of data and searching through them is the purpose of database software. Often so-called relational databases are employed for storing data. The data is organized in several tables and the association of rows residing in different tables is established by columns containing the same information even though in different tables. For instance, in the context of EXFOR, each table could possess a column EntryID containing the entry identification numbers which uniquely identify experiments. This column links the rows of different tables together.
Using this database type to store the EXFOR library requires to restructure the information present in the collection of EXFOR entries as a collection of tables. The variability in terms of the amount of information of different subentries makes this conversion process challenging. For instance, a specific field type can be present in one subentry but absent in another one. These challenges for the conversion can be solved. To the knowledge of the author, the EXFOR library in form of a relational database is the backbone of the EXFOR web interface provided by the IAEA.
However, because the structure of a relational database is quite different from the hierarchical structure of an EXFOR entry and there may be large differences in the amount of information available in different entries, a relational database may not be in all situations the best option for the user. For instance, it may not be easy for the user to augment their local EXFOR database with additional fields, e.g., computed from other fields. Furthermore, an elaborated system of tables to capture all the information in the EXFOR library may be complicated to navigate and understand from the user perspective. It may therefore be beneficial to find a database solution that allows to store the EXFOR data in a form that is very close to the original hierarchical representation present in the original EXFOR entries.
Database solutions that can store the EXFOR entries in a way that preserves their original structure are available in the form of so-called document-oriented databases. This database type belongs to the class of NoSQL databases. Instead of using a collection of tables, document-oriented databases manage a collection of documents. A document includes all information related to one entity. This is in sharp contrast to relational databases where properties belonging to one entity are usually distributed over several tables. The advantage of this database software in context of the EXFOR library is that each EXFOR entry can be stored as a document without any need to restructure or distribute its information.
The database adopted here is MongoDB [2], which can be downloaded and used free of charge. In a MongoDB database, documents are organized in collections. Documents are stored in the BSON format which stands for binary JSON. Technical details aside, the BSON and JSON format are equivalent. Due to this reason, it is possible to directly insert EXFOR entries given in the JSON format into a collection of a MongoDB database. A script to convert EXFOR entries to JSON objects and insert them into a MongoDB database has been made available at [6].
Once the MongoDB database is filled, the expressive query language provided by the MongoDB database software can be used to search for relevant information. Simple usage examples showing how EXFOR data can be queried from Python will be the topic of the next section. The remainder of this section elaborates on some choices made for the conversion from the original EXFOR library to a MongoDB database as it is provided at [7]. The following list describes effected modifications, which are in the opinion of the author reasonable and helpful: 1. The logical unit most people operate with are not entries but subentries. For this reason, the documents in the MongoDB database are subentries.
2. The first subentry of an entry contains information which is common to all subsequent subentries.
For this reason, the information of the first subentry has been merged into all subsequent subentries of the same entry, before they were added to the MongoDB database. Sometimes collisions of field names occur during the merging process. In such cases, the offending field names in the first subentry are altered by adding the suffix _firstSub to them prior to merging.
3. The information in a COMMON section of an original EXFOR subentry contains quantities that are con-stant for all measured data points. For instance, an angle differential cross sections may have been measured at various angles for 15 MeV incident neutrons. Thus the neutron energy is often stored in a COMMON field to avoid redundancy. From the user point of view, it may be still helpful to have the incident energy stored in the DATA table together with the angles, and the measured cross section value. Information in the COMMON section is therefore merged into the DATA table but nevertheless also preserved as a separate field.
4. Standardized units are helpful to remove conversion errors. Therefore all energies have been converted to MeV and all cross sections to millibarn. Also compound units such as associated with angle differential cross sections and spectra are modified accordingly.
Some other modifications of less significance have not been mentioned here for the sake of brevity. They are documented in the manual at [7] accompanying the installation files for the MongoDB EXFOR database.

Simple usage example
According to the TIOBE index [8] one of the most popular programming languages is Python, which finds also broad adoption in the field of nuclear physics. Therefore some simple examples how to retrieve data from the MongoDB EXFOR database are provided here to demonstrate the ease of use. The following examples rely on the pymongo module.
To interact with the database, one needs first to connect to it: 1 from pymongo import MongoClient 2 client = MongoClient (' localhost ', 27017) 3 db = client [" exfor "] 4 entries = db [" entries "] One of the most elementary user actions is to retrieve a subentry using its subentry identification number (an eight digit string). The expression passed as an argument to the function find_one specifies the search query. Search queries for a MongoDB database have to be formulated as JSON objects. Since a nested dictionary in Python is essentially equivalent to a JSON object, pythonists can probably get used to this query syntax quickly.
The result of the query is a (potentially nested) dictionary. It can be explored by making use of the Python functions provided for dictionaries. Just as an example: 9 import re 10 regex = re. compile ( The variable subents is an iterator, which can be used to iterate over the found subentries in a loop, e.g.: 16 for subent in subents :

Conclusions
We argued that the JSON format is a suitable format to store all the information available in the EXFOR library. Entries and subentries in the EXFOR library can be converted without loss of information or accuracy to the JSON format. Due to the wide support for the JSON format, the extraction of EXFOR data from JSON objects is trivial in high-level languages, such as Python. A EXFOR to JSON converter package has been made available at [6].
We also argued that a relational database may not always be the ideal solution to store the data of the EX-FOR library from the perspective of the user. Documentoriented databases, such as MongoDB, enable storing the EXFOR library without structural transformations. As JSON objects can be imported into a MongoDB database and an EXFOR to JSON converter is in place, the conversion of the complete EXFOR library into a MongoDB database is straight-forward. A script that automates this conversion process is available at [9].
The complete installation process of the MongoDB EXFOR database requires several steps, such as the installation of the MongoDB database and the conversion of EXFOR entries. Therefore, to facilitate the installation for the user, the installation process has been automated to a large extent using the Docker technology for virtualization. The required files and installation instructions to install the EXFOR MongoDB on the local computer can be found at [7].
It is the hope that the provided computational database will be helpful for users in its current form. Future use cases will certainly point to possible improvements and issues will potentially surface. Because modifications of the database can be effected by the user themselves without too much efforts, users can become designers and for instance create their own computational formats and databases. This circumstance may foster the development of tools and formats related to nuclear data and potentially provides inspiration for how to improve the official EX-FOR library.