Final Analysis Sample Metadata

IceCube is a cubic kilometer neutrino detector located at the South Pole. Data are processed and filtered in a data center at the site. After transfer to a data warehouse in the northern hemisphere, data are further refined through multiple levels of selection and reconstruction to generate analysis samples. So far, the production and curation of these analysis samples has been handled in an ad-hoc way in IceCube. New tools have been developed to capture and validate the metadata associated to these data samples in consistent and machine readable specifications. Development was driven by analysis use-cases and pursuing reproducibility of scientific results.


Introduction
The IceCube detector [1] is located at the geographic South Pole and was completed at the end of 2010. It consists of 5160 optical sensors buried between 1450 and 2450 meters below the surface of the South Pole ice sheet and is designed to detect interactions of neutrinos of astrophysical origin.
Data taken from the detector are stored in a data warehouse at the University of WisconsinMadison. Scientific Working Groups perform analyses and create derivative data products (i.e. Level 2, Level 3, Level 4) that become analysis samples. Some of these analysis samples contain the data that is used to publish results in scientific papers.
The tracking of analysis samples within working groups has been an adhoc process within IceCube. At times, it has been difficult to track down the data and code that were used to generate a particular result. This is not conducive to reproducibility, a key requirement for all results across every scientific discipline.
We created a Final Analysis Sample registration service to provide a central and machine readable record of analysis samples created by working groups. Our goal was to ensure that every result is reproducible by identifying the data and code that were used to generate that result.

Methods
We built a web form where physicists can register their analysis data. Some sample level metadata fields are collected, along with metadata for each subsample. The sample level fields collected by the web form are: The metadata collected for subsamples are: Physicists fill the web form with the appropriate metadata and then click a large button labeled Register at the bottom of the form. After validating the input data, the registration service begins to register the provided analysis sample.

Subsample Registration
The service ensures that every file specified in the subsample is registered with a File Catalog service in IceCube. After all of the files are registered, the input files are grouped into collections, which are files associated by a query. The service then registers a snapshot of the collection with the File Catalog, to ensure the exact set of input files is always reproducible.
Parent data files represent input to the analysis. These can be experimental or simulation data. It is expected that all such files will already exist in the data warehouse at UWMadison. The physicist provides a text file containing all the data warehouse paths of the input file set. These are then registered as collections and snapshots as described above.
Child data files represent the output of the analysis. It is assumed that the physicist ran the analysis code on the parent data to produce the child data. The child data is thus assumed to be in a working directory, and not yet registered in the data warehouse. The registration service copies the child data files to the canonical location for analysis samples in the data warehouse. These are then registered as collections and snapshots as described above.

Metadata Generation
After registering an analysis sample, the child data is stored in the canonical location for analysis samples in the data warehouse. The service also places a metadata file called sample.json in the canonical directory.
An example directory structure looks like this: The top level directories real and sim separate experimental and simulation data. Under the sim directory, subdirectories identify the different systematics. In our example above, there is a baseline simulation, along with a simulation that uses a different ice model.

Discussion
The Final Analysis Sample registration service is still under development. The current version is implemented in 1000 lines of CoffeeScript[2]. Input from physicists was solicited for the development of the form. Further rounds of use and review are expected to occur in the near future.