ATLAS Open Data – Development of a simple-but-real HEP data analysis framework

. The ATLAS Collaboration at the Large Hadron Collider is releasing a new set of recorded and simulated data samples at a centre-of-mass energy of 13 TeV collected in pp collisions at the LHC. This new dataset was designed after an in-depth review of the usage of the previous release of samples at 8 TeV. That review showed that capacity-building is one of the most important and abundant uses of public ATLAS samples. To fulfil the requirements of the community and at the same time attract new users and use cases, we developed real analysis software based on ROOT in two of the most popular programming languages: C++ and Python. These so-called analysis frameworks are complex enough to reproduce with reasonable accuracy the results -figures and final yields-of published ATLAS Collaboration physics papers, but still light enough to be run on commodity hardware. With the computers that university students and regular classrooms typically have, students can explore LHC data with similar techniques to those used by current ATLAS analysers. We present the development path and the final result of these analysis frameworks, their products and how they are distributed to final users inside and outside the ATLAS community.


Introduction
The main purpose of the ATLAS Open Data is to provide open access to proton-proton (pp) collision data and software and analysis tools from the ATLAS experiment at the Large Hadron Collider (LHC), in accordance with the ATLAS Open Data Access policy [1], which sets out the guidelines regarding open access to ATLAS data by non-ATLAS members with a focus on education, training and outreach. The datasets and the tools are made available through the CERN and ATLAS Open Data portals [2,3]. The target audiences are highschool, undergraduate and graduate students, as well as teachers and lecturers.
The ATLAS data have been successfully deployed since 2010 in the IPPOG [4] International Masterclasses (IMC) [5] where high-school students perform various measurements based on proton-proton collisions. The ambition to bring to the "classrooms" important LHC discoveries is realised using the discovery of the Higgs boson. In 2012 an ATLAS protonproton collision data sample of 2 fb −1 at an energy of 8 TeV was released in XML format and used to "search", among other particles, for the Higgs boson. This is followed by an 8-TeV data sample, corresponding to 1 fb −1 , in ROOT [6] format, accompanied by the corresponding analysis tools [7] and aimed for undergraduate physics students. The 8-TeV data and tools led to a wide range of activities [8]: hands-on particle-physics exercises; production of teaching materials, lectures and public talks; as well as the introduction of new "researchbased particle physics" courses, such as in [9]. Following feedback from the users, a new release of ATLAS Open Data has been made public, this time using 13 TeV data from Run II of the LHC.

13 TeV ATLAS Open Data release
A new set of pp collision data at √s = 13 TeV has been released to the public for educational purposes [10]. The dataset corresponds to an integrated luminosity of 10 fb −1 recorded by the ATLAS detector at the LHC in 2016. Monte Carlo (MC) simulation samples describing several Standard Model (SM) and beyond the SM (BSM) processes, which are used to model the expected distributions of different signal and background processes, are included in the release as well.  As demonstrated in Figure 2 the 13 TeV ATLAS Open Data SM Higgs MC signals include more production mechanisms (VH and ttH in addition to ggF and VBF) and contain more physics objects (photons and tau leptons in addition to electrons and muons). There are more BSM MC signals in the new 13 TeV release (more than 50 samples including graviton excitations, supersymmetry and dark matter, in addition to new gauge bosons Z') than in the 8 TeV release (14 Z' samples).   [10], where the references to the various generators in the table can also be found.
The data ntuple structure has also evolved from about 50 variables in 2016 to about 90 variables in 2019 ( Figure 3). Photons, tau-leptons, large-radius jets, b-quark and bosontagging, as well as systematics are available in the 13 TeV release, enabling more detailed studies. A total of about 150 GB of storage is needed for all collision and simulated 13 TeV ATLAS Open Data.

ATLAS Open Data tools
Efforts were put into the production, validation and release of the ATLAS 13 TeV data and MC simulation (in ROOT ntuples format) for training and education, including a full software and analysis framework written both in C++ and Python and interfaced to ROOT. The framework implements the necessary features to read the dataset, and make histograms and plots. Example analyses are provided, among which some exemplify how to do event selections, weight events appropriately, and scale MC samples to the right cross section and luminosity. One way of achieving an operating-system-independent analysis is by means of a Virtual Machine (VM), where all the necessary software is pre-installed. An overview of the ATLAS Open Data VM is shown in Figure 4. This VM runs a Linux-based operating system, and has ROOT and other relevant software pre-installed. Furthermore, the VM contains the software analysis code and the dataset, as well as relevant documentation. However, VMs containing datasets require a certain amount of disk space. Also, VMs usually require considerable graphics resources, which may slow down the running significantly. A solution to these problems is to run the VM as a server, which is possible with the new generation of VMs. The fact that the applications of the new generation of VMs are simply opened in the browser of the host machine / laptop requires far less disk space and graphics resources than with the standard VMs. An illustration of this system is given in Figure 5. Jupyter notebooks use both Python and C++ ROOT kernels to analyse data and MC samples and produce physics-analysis results using a VM as a server. JavaScript applications allow simple cut-and-count analyses. GitHub and GitLab are used as repositories and GitBooks for documentation.
In addition to acquiring particle-physics knowledge, the students have the opportunity to work with modern software tools based on new technology and get an insight into principles of cloud and distributed computing.

Analysis examples
Various physics analysis examples are extensively discussed in Reference [10]. More are planned to be released soon. All will be documented and made available through notebooks as discussed in Section 3. Here we give two examples for illustration.

Search for Higgs decaying to diphotons
The first example makes use of a final state with two well reconstructed photons with high transverse momenta pT. The distributions of the invariant mass, m, of selected diphoton events are shown in Figure 6. This is done for two unconverted central (pseudorapidity< 0.75) 1 photons (left) and inclusively for both unconverted and converted photons (right). There is a good agreement between the 13 TeV data and the SM prediction featuring a Higgs boson of mass 125 GeV. With this example the students have for the first time the possibility not only to display Higgs candidates, as already done by high-school students within masterclasses, but to closely reproduce the analysis work that led to a discovery [11].
With the advent of the ATLAS 13 TeV data, we have the ambition to tune and expand the educational material to follow the LHC 'heartbeats' and to influence the textbooks and teaching methods already at high school level.

Search for electroweak production of supersymmetric particles
After the discovery of the last missing particle of the SM, the next challenges at the LHC are: what kinds of new physics are to be found beyond the SM? Where is dark matter? Are there additional fundamental symmetries? The second example developed here deals with a search for supersymmetry, through the production of a pair of sleptons, the super-partners of the known leptons. Each slepton decays to a lepton and a neutralino, an invisible dark matter candidate particle. The final state studied here features two charged leptons (electrons or muons) and missing transverse energy. The analysis closely follows Reference [12].
The distributions of the transverse momentum of the leading lepton, dilepton invariant mass, missing transverse momentum ET miss and "stransverse" mass mT2 before and after the two signal-region selections are shown in Figure 7. Good agreement is found between data and MC prediction, even in signal regions where large statistical fluctuations are observed. The number of observed (expected) events in the "loose" and "tight" signal regions are 57 (58.5) and 4 (2.4), respectively. Interesting to note are the events observed in data at very high values of ET miss (> 900 GeV). These have been investigated further and found to contain very high-pT calorimeter-tagged muons which pass the loose muon selection working point. By applying tight identification criteria on the muons, these events could be rejected. The display and study of such events could make an excellent technical task for the students, giving them the opportunity to learn more about detection techniques.

Summary and outlook
ATLAS 13 TeV Open Data samples, collected in pp collisions at the LHC and corresponding to an integrated luminosity of 10 fb −1 , and various Standard Model and Beyond SM physics simulation samples, have been made public for education. The samples are accompanied by data processing and analysis tools. These resources allow students to easily access and analyse data using desktops or laptops, practise programming (Python & C++) and adapt Jupyter notebooks using both Python and C++ ROOT to analyse data and MC samples and produce physics analysis results using a VM as a server. Both the data and tools have been validated and deployed in a number of university courses. The students acquire particlephysics knowledge by making SM measurements, searching for new physics and reproducing as closely as possible important results published by the ATLAS Collaboration. In addition,