The Scikit HEP Project -- overview and prospects

Scikit-HEP is a community-driven and community-oriented project with the goal of providing an ecosystem for particle physics data analysis in Python. Scikit-HEP is a toolset of approximately twenty packages and a few"affiliated"packages. It expands the typical Python data analysis tools for particle physicists. Each package focuses on a particular topic, and interacts with other packages in the toolset, where appropriate. Most of the packages are easy to install in many environments; much work has been done this year to provide binary"wheels"on PyPI and conda-forge packages. The Scikit-HEP project has been gaining interest and momentum, by building a user and developer community engaging collaboration across experiments. Some of the packages are being used by other communities, including the astroparticle physics community. An overview of the overall project and toolset will be presented, as well as a vision for development and sustainability.


Introduction
Python is an ever more popular programming language across a broad range of communities, notably in Data Science. Outside High Energy Physics (HEP), the Python scientific ecosystem is built atop the "building blocks" of the SciPy ecosystem of open-source software for mathematics, science, and engineering [1]. Figure 1 provides a good visual illustration of the ecosystem, which grows from foundational libraries all the way to domain-specific projects such as Astropy [2]. The ecosystem provides tools for data manipulation, visualisation, statistics, machine learning, etc. Traditionally, HEP has been evolving in a rather disjoint ecosystem based on the C++ ROOT data analysis framework [4]. Same as for the Python scientific ecosystem, it provides tools for data manipulation and modeling, for fitting, for statistics and machine learning applications. But it is a toolkit rather than a toolset, with an interface that is not that natural for Python -via bindings it provides.
Various initiatives tried to link both HEP and non-HEP worlds, at least for specific tasks. Unfortunately, the libraries were very largely developed by a single author and sustainability became quickly an issue, especially that many such authors left the field. Also, community adoption did not stick. We believed that a more generalised effort, domain-specific oriented, was the way forward, and this gave rise to the Scikit-HEP project in late 2016. It is only in 2018 that community adoption started to take up, with several project packages attracting much attention. We are proud to mention that several collider (Belle II, CMS) and non-collider (KM3NeT) experiments officially use some of Scikit-HEP in their external dependencies, as do other software projects (Coffea, zfit).
The project had been presented at the CHEP 2018 conference [5]. The present report superseeds Ref. [5] and presents the status of the project, which evolved considerably since CHEP 2018.

Scikit-HEP project overview
The Scikit-HEP project [6] is a community-driven and community-oriented effort with the aim of providing Particle Physics at large with a toolset ecosystem for data analysis in Python.
It does not attempt in any way to provide a replacement for the Python ecosystem based on the SciPy suite; it rather builds on its foundational libraries providing core and common tools for the HEP community. The grand plan of the project can be summarised in the following points: • Create an ecosystem for particle physics data analysis in Python.
• Improve the interoperability between HEP tools and the scientific ecosystem in Python.
• Expand the typical toolset for particle physicists with high-standards, well-documented and easily installable domain-specific packages.
• Build a community of developers and users, having sustainability in mind.
• Improve discoverability of (domain specific) relevant tools.
The Scikit-HEP toolset is depicted (to a large extent) in figure 2. Some of the packages found in the GitHub organisation, such as the well-known packages root_numpy [7] and root_pandas [8], pre-dating the project, are not described in this report. They are nevertheless part of the project, but largely deprecated by the new and more versatile packages uproot [9] and awkward-array [10], see below. More importantly, it should be emphasised that most of the packages presently constituting the Scikit-HEP toolset are relatively new, having been released for the first time after the CHEP 2018 conference; these are marked as "new package" in figure 2.
The remainder of this report provides a whirlwind tour of the main packages. Overview of (most of) the packages making the Scikit-HEP toolset. For a sense of evolution all packages whose first release came out after the CHEP 2018 conference are marked as "new package". All GitHub repositories can be found at the location https://github.com/scikit-hep.

Whirlwind tour of Scikit-HEP packages
The obvious "point of entry" to the HEP ecosystem is via ROOT files. These can be natively and trivially read with the pure Python I/O uproot package [9], whose only dependencies are NumPy and Python libraries to deal with compression and decompression. The package is installable via pip or conda on virtually any computer, straighforwardly since it does not depend on ROOT. It has been a runaway success with over 15000 downloads per month.
The awkward-array [10] package provides a way to analyse these variable-length tree-like data in Python by extending Numpy's idioms from flat arrays to arrays of data structures. The package is being reimplemented in C++, with a simpler interface and less limitations, based on acquired experience and user feedback; the developments are taking place at https: //github.com/scikit-hep/awkward-1.0. The analysis of datasets (processed e.g. with the two packages just described) typically involves data aggregations; these are most often in the form of one-dimensional histograms. Indeed, histogramming is central in any analysis workflow and has received much attention. The package boost-histogram [11] bundles the Python bindings for the performant C++14 multi-dimensional templated header-only library Boost.Histogram [12], albeit with a Pythonic API. Histograms can be defined in a very versatile way owing to the extensive types of axes (regular, variable, circular axes, etc.) and storages (interger, double, weighted values) defined, in multi dimensions. The package provides methods for selecting, rebinning, and projecting into lower-dimensional space. It is a high-performance histogramming package naturally talking to the NumPy ecosystem. Its interface is simple and user-friendly: This code snippet produces this output (figure 3): Continuing a common flow of work in a HEP analysis, it is important to mention the iminuit fitting package [13], the Python interface (bindings) to the Minuit2 C++ package [14] used across particle physics. Minuit2 is most commonly used for likelihood fits of models to data, and to get model parameter error estimates from likelihood profile analysis. The bindings constitute an important building block for sophisticated data modelling packages. It is used in other HEP packages and in various astroparticle physics packages.
The Scikit-HEP project contains other utility packages such as: • The Particle package [15]: a Pythonic interface to the Particle Data Group (PDG) particle data table and Monte Carlo particle identification codes, with a multitude of goodies such as powerful and flexible searches as one-liners.
• The DecayLanguage package [16]: tools to parse decay files and programmatically manipulate them, query and display information; classes for a universal representation of particle decay chains.

Outlook
The Scikit-HEP project has been gaining much interest and momentum in the last couple of years. Together with other projects, it is providing a modern and alternative ecosystem for HEP analysis, in Python. The project is community-driven and community-oriented. It is building a user and developer community engaging collaboration across experiments. This is crucial to ensure continuity and sustainability, with a culture where the users of today are meant to become the developers of tomorrow. Some of the project packages are being used by other communities, including the astroparticle physics community.