Columnar data analysis with ATLAS analysis formats

. Future analysis of ATLAS data will involve new small-sized analysis formats to cope with the increased storage needs. The smallest of these, named DAOD_PHYSLITE, has calibrations already applied to allow fast downstream analysis and avoid the need for further analysis-speciﬁc intermediate formats. This allows for application of the “columnar analysis” paradigm where operations are applied on a per-array instead of a per-event basis. We will present methods to read the data into memory, using Uproot, and also discuss I / O aspects of columnar data and alternatives to the ROOT data format. Furthermore, we will show a representation of the event data model using the Awkward Array package and present proof of concept for a simple analysis application.


Introduction
The increasing amounts of data from the ATLAS [1] experiment in the coming LHC Run 3 and even more so in the future High-Luminosity LHC (HL-LHC) data taking era [2] demand for efficient processing to be able to analyze these data with the available computing resources. Finding a compromise between productivity of analyzers and highly optimized data analysis code is important. Columnar data analysis allows for such a compromise by separating the low-level numerical calculations on large arrays from the analysis code (typically written in python) which describes operations that process an array at a time. In contrast, the more traditional approach in high energy physics (HEP) consists of manually writing event loops that necessitate the usage of compiled programming languages like C++ to produce efficient code. This code is often time consuming to develop.
With the increasing demand for processing large amounts of data in fields outside of HEP, and the rising popularity of machine learning methods, the availability of general tools, especially in the scientific python ecosystem, has increased. Additions to this ecosystem, inspired by HEP needs, have made significant progress in recent years. Many of them are collected under the umbrella of the Scikit-HEP [3] project. Currently, the Coffea framework [4], together with the Awkward Array package [5], provide the most complete set of tools for columnar data analysis in HEP.
The ATLAS analysis model [6] for the LHC Run 3 includes new small-sized analysis formats DAOD_PHYS and DAOD_PHYSLITE. The latter has a set of calibrations and reconstructions, harmonized across common analysis use cases, already precalculated which allows running fast analysis without the need to call complex reconstruction and calibration algorithms. DAOD_PHYSLITE is therefore an ideal input dataset for columnar data analysis, as discussed in this document.  Figure 1. Loading times 1 for nested structures where data content and offsets (list lengths) cannot be read separately from stored columns. The times include the time needed for decompression. The decompression time without deserialization is shown for reference, as well as an example for a branch that can be read in a columnar fashion, indicated by a dashed black line. Using the improved deserialization routines, complex data structures can be read efficiently, with similar or better performance than ROOT's TTree::Draw function. All times are recorded after running a second time to load from cached pages instead of hard disk.

I/O and
Python data science tools mostly deal with flat n-tuples (table-like formats). However, HEP data like DAOD_PHYSLITE contains variable length lists of numbers per event (socalled single-jagged vectors), e.g. list of tracks, leptons etc. The ROOT [7] data format, currently used for DAOD_PHYSLITE, allows for columnar reading of such single-jagged structures, since event offsets are stored separately from the data. Structures with multiple members can also be automatically split into separate columns, which is done for DAOD_PHYSLITE where possible. The Uproot [8] package can therefore read these quantities efficiently using NumPy [9]. More complex are cases with a higher level of nesting, where offsets are not stored separately in the ROOT file format. A typical example for this in DAOD_PHYSLITE are references into other collections -called ElementLink. For example, an electron can have multiple tracks associated, thus needing a double-jagged vector (vector<vector<ElementLink>>) to represent these indices. Furthermore, ElementLink structures that are more than single-jagged are not split into their members (an index and a hash representing the collection linked to). Reading such complex data structures requires routines that iterate through the data, collecting the offsets and data. The current python code used for this in Uproot needs to be replaced by compiled routines for efficient reading. Figure 1 shows an example using a python implementation with just-in-time compilation using numba [10]. This confirms the findings shown in Ref. [5]. Notably, the loading time is always lower than reading with ROOT's TTree::Draw function. The performance for reading the DAOD_PHYSLITE format with Uproot will therefore not be limited by deserializing branches with higher level nesting when such specialized routines are used. They are foreseen to be included in future versions of the Awkward Array package.
Reading data into a columnar representation in memory also poses the question what the best storage format for this workflow is. With splitting of complex objects into members and grouping their data into contiguous blocks (called baskets), the ROOT format is well suited for columnar data access. However, the basket size needs to be chosen appropriately. Columnar data analysis workflows can benefit from as large as possible blocks of contiguous data. Besides better compression efficiency, this can lead to higher throughput when reading data from disks, especially in case only few of all available columns are read, which is common in ATLAS data analysis. In environments with parallel disk access of multiple processes the reduced amount of disk seeks is a major factor for this aspect. Figure 2 shows a visualization of the data access patterns when reading only few of all available columns, comparing the currently used ROOT settings for DAOD_PHYSLITE with comparably small basket sizes to the Apache Parquet format [11] with one row group (effectively storing each column in one contiguous block) and ROOT with large baskets (one basket per column).

ROOT (original):
Parquet: ROOT (large baskets): When only whole-column access is needed, storage formats other than ROOT (potentially much simpler ones) allow for significantly faster reading (see Table 1). The slower reading of the ROOT format, using Uproot, when comparing to other formats with similar compression settings, can be attributed to the higher complexity of the ROOT format and thus a higher (constant) overhead in addition to the pure data reading. Formats without support for nested/jagged structures (namely HDF5 [12] and zipped NumPy arrays [13]) can be used as storage via Awkward Array's to_buffers/from_buffers feature to split data into contiguous arrays and reconstitute them via a JSON [14] form. In addition to faster reading, pure columnar storage is easier to compress, which results in smaller file sizes, especially for compression algorithms with high compression ratios such as gzip. Deduplication of offsets (e.g. storing the number of electrons only once instead of for each electron property) can reduce the file size even more (compression alone can't achieve that since each column

Columnar representation of ATLAS analysis data
The ATLAS event data model used in DAOD_PHYSLITE is converted into columnar form using Awkward Array. Many concepts used are similar to the NanoEvents module in Coffea, that has been developed for the NANOAOD [16] format at CMS [17]. The

Proof of concept
To test the columnar analysis workflow for a realistic analysis example, the representation from Section 3 is used to perform a set of object selections for electrons, muons and jets on a Monte Carlo sample of 50000 simulated events of tt decays in DAOD_PHYSLITE format. A C++ Athena [18] algorithm using the SUSYTools package [19] is used as a reference for comparison. The selection criteria in the columnar analysis are implemented to match the C++ reference as close as possible, but some details like track parameter selections are omitted. This object selection consists of two types of operations. First, filters on stored quantities like isolation variables, momenta and pseudorapidy values have to be applied. The convention from SUSYTools defines objects satisfying loose "baseline" and tighter "signal" criteria. Second, overlap removal for ambiguity resolution and additional isolation has to be applied. This is more complex, since it involves checking all combinations of 2 particles. The calculation uses the awkward.cartesian function that constructs these combinations. Distance or index based matching can then be performed in a vectorized fashion. As shown in Figure 3, for most selections, the number of objects passing the SUSYTools selection can be reproduced by a columnar data analysis in python.  The processing times for the columnar data analysis are also significantly smaller than the reference, confirming the efficiency of this approach. A comparison is shown in Table 2. A quick test on a different data sample showed that the processing time roughly scales with the Table 2. Processing time for 50000 DAOD_PHYSLITE events using the reference implementation with Athena/SUSYTools in C++, compared to a columnar data analysis in python. In both cases, only object selections for electrons, muons and jets are evaluated. The total time does not include a constant overhead for initializing the necessary objects and environments before loading (non-meta) data and event processing. This overhead amounts to approximately 15s for Athena/SUSYTools and 2.4s for the columnar data analysis. The measurement "Columnar (cached)" refers to processing with all data already decompressed, deserialized and loaded into memory. This time is significantly lower than the total time that includes loading, indicating that most of the time is spent for input, decompressing and deserialization for the columnar data analysis. All times refer to the minimum of 5 consecutive runs.

Measurement
Total mean number of jets. Average data samples are therefore expected to be processed slightly faster than the tested tt Monte Carlo sample. While this first demonstration of an object selection for electrons, muons and jets was successful, many steps of a typical analysis remain to be done. Some of them will require further research and development. One example is the calculation of the missing transverse momentum that is built depending on the analysis specific object selections and has therefore also be addressed in an analysis of the DAOD_PHYSLITE format. From experience with the event-loop-based analysis this is expected to increase the processing time by up to a factor of 2. Another important topic is the treatment of systematic uncertainty variations for later statistical analysis. To be able to calculate these in a columnar data analysis they have to be applicable in a parameterized fashion without the need for running complex reconstruction and calibration algorithms.

Summary
Columnar data analysis offers an attractive way to analyze the new ATLAS data analysis format, DAOD_PHYSLITE. Reading data into memory, representing the event data model in columnar format and typical analysis workflows can be done with available python tools. Such a workflow was shown to be efficient, offering significantly faster throughput compared to traditional analysis. However, further research and development is needed to cover all aspects of a full analysis. Due to the efficient processing such an analysis is mainly limited by input reading. Therefore, further tuning of storage settings like ROOT basket sizes could improve the throughput even more, especially when considering parallel file access from many workers. When a format is only used for columnar data analysis, storage formats other than ROOT can be an interesting, performant option. Such alternative formats could also provide substantial savings in terms of storage space, the most critical factor for HL-LHC resource requirements.