Pandas DataFrames for a FAST binned analysis at CMS

. Binned data frames are a generalisation of multi-dimensional histograms, represented in a tabular format with one category per row containing the labels, bin contents, uncertainties and so on. Pandas is an industry-standard tool, which provides a data frame implementation complete with routines for data frame manipultion, persistency, visualisation, and easy access to “big data” scientiﬁc libraries and machine learning tools. FAST (the Faster Analysis Software Taskforce) has developed a generic approach for typical binned HEP analyses, driving the summary of ROOT Trees to multiple binned DataFrames with a yaml-based analysis description. Using Continuous Integration to run sub-sets of the analysis, we can monitor and test changes to the analysis itself, and deploy documentation automatically. This report describes this approach using examples from a public CMS tutorial and details the beneﬁt over traditional methods.


Introduction
The Faster Analysis Software Taskforce (FAST) was set up in May 2017. As a group of particle physicists, the initial goal was to investigate ways to address the growing data processing requirements from current and future HEP projects (e.g. at HL-LHC) against a back-drop of decelerating increases in processing power technology, similar to the HEP Software Foundation, whose community white paper was published around the same time [1].
FAST's objectives include: establishing a set of best practices for analysis tooling; sharing these as feedback to developers and helping to educate peers and other users; and to contribute to existing tools or, where felt necessary, develop our own to close gaps in the HEP analysis software ecosystem. To meet these goals, FAST has met for regular hacking workshops to experiment with new ideas and techniques.
This paper describes one of our primary focuses so far, namely how an analysis whose result is produced by comparison of binned distributions can adapt to using the Pandas library's data frame implementation for internal persistency and manipulation. Such analyses are common within CMS and other experiments [2]. In this paper, analysis refers to the final stages of data processing, after event reconstruction.
This paper documents the prototype approach we have developed. Updates and developments will be announced via our homepage: http://fast-hep.web.cern.ch/fast-hep/public/. Since CHEP2018, many of the packages that grew out of this prototype have now been documented and added to PyPI, which can be found via the homepage.  (Right). The FAST approach is more streamlined and exploits the versatility, performance, and functionality provided by the Pandas data analysis package. Input files in most HEP analyses will have a ROOT-based format [3]. It is commonplace to pre-process these input files to reduce their size, either by removing events that will definitely fail a later selection (a "skim") or by removing variables that are not of use (a "slim"). Since the processing time in a typical analysis chain can be slow, this preprocessing is helpful when repeated iterations are needed as the analysis is refined to produce the final result. This is the first change in the FAST approach-remove this step completely by making sure the subsequent steps run quickly. As a result there is little-to-no need to pre-process data, making it easier to rerun as new data becomes available.

The FAST approach
The next overall change is to simplify the internals of the analysis framework by using a single data format, the Pandas DataFrame [4], to persist data. An early step in many analyses is to produce a large numbers of ROOT histograms, where additional information will be stored in the name of the histogram in some structured way. Additional information might be persisted in other formats, such as distributions for event weighting (e.g. CSV or Python pickles), which run and event numbers to look at (e.g. JSON), how to combine the distributions to fit the final result (e.g. RooStats workspaces). Pandas DataFrames are able to cover all of these use cases, such that it is easier to manage the different data sources coherently.
Simplifying the analysis chain and homogenizing the internal data formats has two additional advantages. Firstly, it becomes easier to adapt the analysis, since their are fewer steps to consider and because the data frame approach allows the binning dimensions to be changed. Secondly, it is much easier for a newcomer to learn how to run the code, given that each internal step is more similar to the others, and because the data format can be inspected and interpreted with less knowledge of how the analysis code itself will look at it.
In order to produce the binned Pandas DataFrames, FAST has used AlphaTwirl [5], which had already been developed by Tai Sakuma, a member of FAST. AlphaTwirl is a Pythonbased tool which ingests event-level data and produces binned columnar data.

Pandas dataframes as multi-dimensional histograms
Pandas [4] is a common data-analysis toolkit written for Python, but with C++ optimisations behind the scenes. The primary data format within Pandas is the DataFrame, a programmatic interface to manipulate tabular data. Within industry applications, a DataFrame is often used to represent time-series data with a fixed number of columns-similar to a ROOT TTree whose events contain only scalars and fixed-length lists. However, Pandas provides for advanced labelling for its rows and columns, such as multi-indexing i.e. nested labelling. A Listing 1: A DataFrame made from random data. Pandas makes the original row index by using the row number. With the set_index method, however, we create a DataFrame with two data columns, indexed by two variables, which allows a DataFrame to act as a multidimensional histogram.
DataFrame can therefore be treated very naturally as a multi-dimensional histogram, where one row is one bin. The methods to interact with bin labelling are highly generalised, giving us the flexibility to manipulate the binned data in a consistent way regardless of the number of dimensions. Listing 1 shows an example DataFrame and how it looks once converted to a multi-index format similar to how our binned data looks. Using Pandas in this way gives greater consistency, fewer lines of code, and direct interfaces to other industry-standard tools, such as numpy, or machine-learning packages, such as sklearn, TensorFlow, etc.

Summarising ROOT trees
The first stage in the analysis pipeline is to produce the binned data frames from ROOT TTrees. To demonstrate how the FAST approach tackles this, we use the CMS HEP tutorial [6], which provides real and simulated sample data sets and a set of ROOT-based C++ analysis scripts for comparisons.

The config file
To run this step, FAST uses the AlphaTwirl package, but wraps the interface with a configuration file, based on YAML [7]. This configuration file reduces the amount of code and isolates analysis-specific decisions and details. This, in turn, allows the code's performance to be improved without changing the analysis itself.
An example of the configuration file is shown in Listing 2. The first section, underneath the key stages, defines how data will be processed. There are three types of stages available at this point: CutFlows (to select events), a BinnedDataFrames (to produce summary histograms), and Scribblers (to insert variables into the event). Once the processing chain is defined, each stage is given a complete description in the other top-level sections, whose names correspond to the stage names.

Event selection
Lines 21 to 26 of Listing 2 show how an event selection can be configured. In this example events are selected where there are more than two isolated muons, the flag triggerIsoMu24 is enabled, and the first muon (ordered by transverse momentum in this data) is greater than 25 GeV. These three cuts are combined with a boolean AND operation, which is indicated by the All key-all cuts specified in that list must pass for the event to pass. Alternatively, this Listing 2: Example configuration to summarize the input ROOT trees. First we add several new variables, then we select events, and lastly we build a binned data frame. could be replaced with Any, which would result in the cuts being combined with a boolean OR. Cuts can be nested, by putting an Any or All dictionary as the value of any cut. Table 1 shows the output of this stage: a description of the number of events passing each cut for each input data set. While only unweighted counts are shown, this can be easily reconfigured to produce a weighted count. From this output, it is only a few steps to produce publication-quality reports of the cut-flow efficiencies, a common requirement in allowing results to be easily re-interpreted. In fact, the FAST approach and code has already been used to do exactly this [2].

Event distributions
The dimuon_mass stage of the configuration in Listing 2 will produce a binned data frame containing the distribution of the dimuon mass. Only events passing the preceding event selection will be included in this stage's output. Table 2 shows the outputs of this stage. See section 4 for how this output can be easily turned into plots.
It can be seen how easy it is to configure such stages from the configuration file and how extra binning dimensions can be added in a generalised manner. Each binned data frame requires a list of dimensions to define how it should be binned. If a dimension is categorical, i.e. already discrete, no binning scheme is necessary; we only specify which variable to use (the in field) and, optionally, what to call the corresponding column in the output (out). If the variable is continuous, however, a binning scheme must be provided, either as a range and the number of bins to divide this into or as a list of bin edges. Finally, a list of variables to use as events weights can also be given; for each of these, a separate DataFrame will be produced.

New variables and custom processing steps
Custom-made stages can also be included, such as ones which calculate new variables, as shown in lines 7 to 13 of Listing 2. This allow users to write analysis-specific stages in Python, but provide key parameters from the config file.

Manipulating dataframes
The Pandas library allows the binned data frames from the various stages to be manipulated easily. For example, line 2 of Listing 3 shows how easy it is to add a new column to the data frame loaded in for Table 2. It next shows a more elaborate set of operations: the original "long form" data frame, where the data set is indicated in the component column, is converted to a "wide form" where each data set has a separate set of count and error columns. The label ordering is then changed, in order to match that from the original tutorial.    Table 2) to convert data frame to long-form and re-order the columns, and Right: the resulting data frame.

Making plots
Pandas has built in support for plotting which can convert the binned data frames to figures that would be familiar to most particle physicists. Listing 4 continues from Listing 3 and demonstrates the code used to reproduce the dimuon mass distribution, and the resulting plot is shown in Fig. 2. Compared to the version produced using the CMS HEP example's ROOTbased code, the Pandas DataFrame approach shows the same contents, although the overflow bin is shown by default, unlike ROOT. Whilst the aesthetics of the plot are different, one can reproduce the ROOT style with only an additional few lines, if really wanted.

Running a fit from binned dataframes
It is very common in the final stages of such an analysis to fit predictions to the observed data in order to extract the parameters of interest. FAST has also worked on how Pandas DataFrames can be plugged into existing tools, particularly in the case of the CMS experiment. Using another YAML-based config file containing details of the fit parameters and systematics, the data frames from the steps described above are converted to the necessary inputs to perform such a fit. data.plot.scatter(x="dimu_mass", y="n", yerr="err", color="k", label="data", ax=ax)  The plot produced using the HEP tutorial's code using ROOT, and Right: The equivalent plot using the FAST approach.

Continuous integration for analysis
Continuous Integration (CI) has become a mainstay of any modern development environment. In addition to using modern tools within the analysis, FAST checks the analysis using CI. The pipeline currently used to test FAST code runs: static code checks and style compliance, unit tests, integrations tests, and automated documentation generation when the master branch is updated. The static code checks ensure that the code remains at a high quality (i.e. PEP8 using Flake8 [8]). Unit tests (written with Pytest [9]) run individual functions and classes through specific checks to pinpoint when an interface has unexpectedly changed, and catch common bugs early. Integration tests, on the other hand, run the whole chain of the analysis on a small testing sample of the data, and compare the outputs for each stage against what is expected. The tools to run these steps were written by FAST, and allows a general overview of when an analysis has changed and by how much. Future versions will also check the analysis' computing performance such as time per event, memory requirements, etc. Finally, the last stage in the CI pipeline generates documentation (using Sphinx [10]), such as examples and a cross-linked API reference, which can be deployed automatically.
Overall, using CI significantly reduces the effort to maintain a consistent, functioning, and documented analysis chain.

Comparisons to traditional approaches
There are many differences between the analysis approach described here compared to a more traditional one. Using the CMS HEP tutorial as a benchmark, the FAST approach is able to describe the analysis using the 26 lines of YAML config, as well as 55 lines in a Python file to create the necessary variables not contained in the input tree. An additional config file is used to describe the input files, which itself takes up 10 lines of text, although this was generated by a command-line tool provided within the FAST code-base. Turning each of the resulting data frames into a plot, adds an additional 15 lines of Python code in the Python notebooks. As such, the full analysis, from trees to plots is described with around 106 lines of text, split between config files for the bulk of the analysis decisions and some Python code to help produce new variables and plot the distributions. Future versions of the FAST codebase will improve the ability to plot binned data frames by adding general functions to assist this, further simplifying the code needed for each analysis. By comparison, the equivalent code in the CMS public analysis (which can be found in the tar-ball in [6]), implemented in C++ and using ROOT, contains more than 600 lines of code. Whilst it is hard to make an "apples to apples" comparison between a Python-based tool and an analysis code written in C++, this nevertheless demonstrates how the FAST approach is able to compress analysis decisions into fewer lines of code, whilst retaining the expressiveness needed to be generic.
Another key metric for comparison is the speed of execution. Here the C++-based analysis code performs better than the current implementation of the FAST approach, taking roughly 4 seconds to execute compared to the FAST code's 60 seconds. Whilst this is a big difference in performance, the fact that this analysis is controlled through the configuration file allows the code to be optimised behind the scenes. AlphaTwirl uses a Python-based event-loop, and preliminary studies suggest that adapting this to a chunked and vectorised alternative using uproot [11], can achieve around factor 30 increases in run-times, i.e. equivalent or better than the C++ example.

Conclusion
The FAST approach has the goals of building a more modular, flexible, and concise way to run a binned particle physics analysis. The generic configuration files that describe each step make it easier to reproduce and share a given analysis, as well as providing a separation between code and analysis details. In addition they allow the underlying code to be optimised behind-the-scenes. Using Pandas DataFrames as the sole means to transfer data between the various analysis stages reduces the complexity and helps adapt the data for use in other industry-standard tools such as Jupyter notebooks.
Although what has been presented here is being used on the CMS experiment already, FAST intends to bring this approach to maturity by modularising the code-base further to provide a series of PyPI-served packages. In addition, the overall performance is being improved, in particular to make use of the packages coming from the scikit-HEP project, including uproot and awkward-array.