hep_tables: Heterogeneous Array Programming for HEP

Array operations are one of the most concise ways of expressing common filtering and simple aggregation operations that is the hallmark of the first step of a particle physics analysis: selection, filtering, basic vector operations, and filling histograms. The High Luminosity run of the Large Hadron Collider (HL-LHC), scheduled to start in 2026, will require physicists to regularly skim datasets that are over a PB in size, and repeatedly run over datasets that are 100's of TB's - too big to fit in memory. Declarative programming techniques are a way of separating the intent of the physicist from the mechanics of finding the data, processing the data, and using distributed computing to process it efficiently that is required to extract the plot or data desired in a timely fashion. This paper describes a prototype library that provides a framework for different sub-systems to cooperate in producing this data, using an array-programming declarative interface. This prototype has a ServiceX data-delivery sub-system and an awkward array sub-system cooperating to generate requested data. The ServiceX system runs against ATLAS xAOD data.


Introduction
A particle physicist uses a number of heterogeneous systems to make a plot [1]. A typical workflow might start with reconstructed data -data that contains physics objects -located in an experiment's production system. The output of the experiment's production system is often located on the GRID, a loosely federated collection of CPU and disk farms. The physicist must submit a job to run on the GRID to access that data. This job extracts the data, applies corrections, and writes out a the data in a simplified format, mostly likely ROOT TTree's. The physicist then downloads the data locally, and uses local tools to run over that data to extract plots. Time scales can be long: while the jobs on the GRID will normally run quickly, GRID site down times, data delivery glitches, etc., often mean there is a long trail when the user runs on 100's of files citegrid-performance. For many analyses this phase takes more than a week. Once the data is stored locally, it can take approximately a day to produce a plot. However, if the physicist decides that they need a new quantity from the original production data, the cycle must be restarted. For this reason, physicists tend to extract a large amount of data, making the local data sets 10's to 100's of TB.
A second aspect makes this harder than, perhaps, it needs to be: the physicist must know multiple programming languages and tools: C++ and python for the GRID jobs, along with command line tools and scripts to submit and babysit the jobs. For making the plots one has to use either C++ and ROOT or Python and sci-kit-hep set of libraries. Further, one has to have more than a passing familiarity with the complexity of distributed computing systems, along with how well the weakest GRID site is working so that it can be avoided.
The python ecosystem has developed a compelling tool-set that uses array-processing semantics to quickly and intuitively analyze that can be loaded as in-memory arrays. In particle physics, the awkward_array array library [2] is the most popular of these libraries, and it borrows and extends semantics from the famous rectilinear numpy [3] array processing library to work with non-uniform multi-dimensional arrays (e.g. JaggedArray's). This project explores using an array-like interface to take over not only the last step in the plot production process, but also the part that uses the GRID. As an example, the following code will make a plot of electron p T starting from a set of GRID files that are the output of the ATLAS Monte Carlo production: 1 from func_adl import EventDataset 2 from hep_tables import xaod_table, make_local Listing 1: A complete hep_tables code snippet that extracts the electron p−T from an ATLAS xAOD Monte Carlo sample, brings it to the local machine, and uses matplotlib to plot it as a histogram.
Line 4 declares the GRID dataset to be used, lines 7-8 define a good electron as being central and p T > 50000 MeV (the units in an ATLAS xAOD file). Line 10 triggers the rendering of the data, which involves calling out to the ServiceX [4] backend to fetch the data. It is returned as an awkward array, which is plotted in lines 12-14.

Library Structure
This library provides a user-interface that looks much like numpy or awkward_array. Instead of immediately executing the operations, however, it records them. When the physicist request the actual plot or other result, that triggers the library to resolve the query into a set of actions via plug-ins. The end-result is a number, or a column of data (rendering of plots is planned).
The library is built in two layers. The first layer is responsible for recording the actions in the code. The second layer is responsible for executing them. The most popular libraries, like numpy and awkward_array, do both these operations at time: an expression like x[0]+y[0] is executed as it is evaluated. Two advantages are gained by splitting this operation in to recording and execution. First, the operations can be fused -so the addition and the array slicing can take place at the same time. This avoids the creation of temporaries. This is done by analyzing the recording and looking for operations that can be fused. Second, the the recording can easily be efficiently shipped to a remote system where it is executed and only the results transmitted back. Or the execution could be split across multiple systems, without the physicist having to track what is going on.

The Interface
There are standards for array programming in the python eco-system. Two in particular. For recliner arrays, numpy is the standard, and for jagged arrays, awkward_array is the standard. A huge amount of effort has gone into both interfaces and it would be counter productive to alter them. None-the-less, there are certain types of queries that are difficult to encode, for example, associated object relationships.

numpy
The original array programming interface. Python has standardized some of the low-level interface work in a PEP [5] (python language extension). Defines basic operators, slicing, and accumulation functions. Designed specifically for rectilinear data.
numpy operations, of course, are expected to be performed on numpy arrays. In the case of a project like hep_tables, one really wants to record the operations for execution later. numpy provides an interface, __array_function__, which is designed for this [6]. As a result, a call to numpy.abs(a) is turned into a call to a. __array_function__. The object a can then record the fact that numpy.abs was called on it.
awkward HEP data is not rectilinear, of course, and awkward_array was written to extend numpy to support more complex arrays, called JaggedArray's. Besides defining slicing across non-uniform array runs, it also defines operations that work on across array axes, for example, counting the number of jets in an event to yield a number-of-jets-per-event column. awkward_array also provides other conveniences like behaviors, which allow an end user to imbue an array with additional behaviors (a 4-vector can be given Lorentz-like semantics).

Extending the Interface
While almost any operation desired is possible with these interfaces, there are places where it is quite difficult to encode them -or that they are not at all straight forward. Consider the following code snippet that matches a Monte Carlo particle to a jet using η − φ matching: for e in electron: if dr(e,t) < min_value: 7 min_value = dr(e,t) 8 particle = t 9 result.append(particle) A particle physicist will look at this and know almost immediately what is being done: find the closest MC particle to the electron. Array programming, however, is not nearly as straight forward as it operates on complete arrays at once.

Dataframe Expressions
There is no python library that will tape-record a set of array operations. There are many libraries that implement numpy semantics and are drop-in replacements [7][8][9]. However, their re-implementation of the numpy semantics is tightly tied to their implementation.
dataframe_expressions is built to be a standalone library that simply records the operations made to DataFrame. The library supplies an abstract DataFrame type; and then records all operations done to it. It makes as few assumptions about the underlying data model as possible. Operations occur on the DataFrame type, and include: • The basic unary and binary math operations (e.g. −a, a + b, a/b, etc.) • A number of python built-in functions (e.g. abs(a)) • Array leaf references and functions (e.g. a. jets and a. jets.count()) • numpy type array references (e.g. numpy.sin(a)) The recording is done with a combination of a tree structure and a python abstract syntax tree (AST). The python AST is rich -the complete python language can be expressed in it (by design). Each dataframe_expressions DataFrame object holds a link to a python AST. If a and b are DataFrame objects, then a+b becomes a new DataFrame object, which contains an AST. The AST is the binary + operator, with reference to the a and b DataFrame's. All expressions are encoded in this way, building a tree of operations.
To use the dataframe_expressions DataFrame, a library creates the base DataFrame. Usually this means defining the source data (a file, etc.). The user then manipulates the DataFrame's, finishing with some sort of a call to render the results of the expression, like the make_local call in the first listing above. At that point, the library takes the DataFrame that represents the final result and can unwind the user's intent, executing it as required.
One advantage of separating the recording of user intent and the rendering of the expression is dataframe_expressions DataFrame can provide some short cuts. These short cuts then do not need to be directly handled by the back-end library. Filter functions are one example:

good_jets = a[good_jet]
Functions like this are a help when one needs to define a good_jet in multiple places. Behind the scenes, dataframe_expressions just calls the lambda with a DataFrame that represents the jet, and records the operations performed on it.
Python lambda functions are also supported. For example, a.jets.pt/1000.0 and a.jets. map(lambda j: j.pt/1000.0) resolve to identical array expressions (given the interpretation library applies map as you'd expect here). Python lambda functions can be used in most places in the code, but their power comes from lambda capture -where a variable declared in the outer lambda is captured by an inner lambda. For example, calculating the ∆R between a jet and electrons: a.jets.map(lambda j: a.electrons.map(lambda e: j.DeltaR(e))). This creates a 2D JaggedArray for each event. The first axis is the jet, and the second axis is the ∆R between that row's jet and each electron in the event. It is now possible to find the electron closest to each jet using sorting, or arg-sorting, or indexing.
It is also possible to add new expressions anywhere in the data model. If an analysis group wants to define good_jet's for everyone, in a header file it would be possible to write: This provides some level of composability -as you can easily chain complex expressions, and still conveniently define them inside a function without having to return lots of different values. For example, one could define a new property of a jet, which was the closest Monte Carlo parton, using η − φ matching.
One particular place this proves useful is is defining a macro at one level, but using it at another. Note that the ptgev is defined before the filter. But used after the filter. It is intuitive that this should work, of course, but technically it turned out to be non-trivial.
Leaves can also be referred to as strings -which helps a great deal in simplifying automated histogram and plot generation: a.jets ['pt'] does exactly what you'd expect it to do. The backend library does not see any of these shortcuts -it just gets the fully resolved AST to process.

HEP Tables
While dataframe_expressions provides the user interface for the package, hep_tables is the implementation. Its job is to take a DataFrame and to render it with the data producing a histogram or other final product.
All dataframe_expressions start with an initial DataFrame object. hep_tables defines a single root DataFrame which currently refers to an ATLAS xAOD dataset (xaod_table). These datasets are registered on the GRID; despite being stored around the world, the dataset names can be thought of a unified namespace. Lines 4-5 in Listing 1 do this. The fact that this is restricted to func_adl's EventDataset is an artifact of the current implementation (see Section 5). The xaod_table is directly derived from a dataframe_expressions DataFrame.
Until the make_local function call in line 10, everything is handled by the dataframe_expressions package: no actual execution is scheduled. The make_local triggers the execution. The hep_tables package uses utilities in the dataframe_expressions package to build the full AST that summarizes the calculation. It then uses a plug-in system to create a plan for execution.
Currently a ServiceX and an awkward_array plug-in exist. Each plug-in can look at a single level in the AST and decide if it can execute it (given it can execute all parents). hep_tables starts at the deepest parts of the tree and finds a backend that can execute each node of the AST. It does its best to use the executor that could handle all the parents. When the executor switches, hep_tables creates a python async task to perform the execution via the selected plug-in. The plug-in's are expected to render awkward_array arrays, and the different plug-in's use these to move data between them. Figure 1 shows the AST that is produced by the dataframe_expressions starting from the xaod_table defined by the hep_tables package (see Listing 1). Each operation or element is a new DataFrame. hep_tables has walked the tree, labeling each node with a plug-in. In this example, all the nodes can be executed by the ServiceX plug-in, however due to limitations in the prototype backend, it cannot. When hep_tables gets to the bottom of the tree (the filter and division), it will submit to ServiceX a request to render the data from the & and .pt nodes, using the awkward_array backend to combine them.

ServiceX
The ServiceX plug-in works by translating the AST tree to func_adl, and submitting the query via the ServiceX web API. The result is returned as awkward_array arrays.
Translation to func_adl for an expression tree like above requires some finesse. The tree contains multiple references to the electron branch, and those that are separated by a distance, the .pt and the [filter] nodes, require an extra carry-along in the func_adl expression. While func_adl has no trouble handling this, the first version of the translator cannot. Though the AST in Figure 1 could be run entirely in func_adl, it isn't.

awkward_array
The awkward_array plug-in is an immediate translator: it executes each AST node, one at a time. It does not try to take advantage of operation fusing. In that sense, this is a direct translation of awkward operations (which also made it very easy to write).

Capabilities
The capabilities of hep_tables are, to first ordered, defined by the backend plug-in's. This section notes a few capabilities that are implemented: • map: The map function is implemented (see Section 2.2). It takes whatever is to the left as a collection and loops over it. The exact operation depends on what is to the left and the lambda function, of course. • Backend-Functions: There are often utility functions declared in the backend. hep_tables needs to know about the existence of these, so they must be declared up front. The DeltaR function is an example of this. Any function can be declared, and there is the possibility of shipping arbitrary C++ down to run on the xAOD/ServiceX backend as well, though that is not currently implemented.
• First and Count and similar functions: a.First() will return the first element of an array, and a.Count() returns number of items in an array. Both of these turn an array into a scalar. Various other aggregation functions are also supported.
• Types are tracked through the expression, though heuristics are used at the start. For example, collections like Electrons are hard-coded into an xAOD data model. But once the collection is accessed, its type is tracked. The leaves, like .pt, are assumed to be a double, for example. This works out mostly, but not always.

Associated Objects Example
This looks at an example that looks at reconstructed electrons from an xAOD file, and plots the matching Monte Carlo electron. The match is defined as an electron that is within ∆R < 0.1. It demonstrates a number of the concepts seen above, and illustrates some of the pain points. The result of the plot in line 49 is shown in Figure 2. Some of the code is familiar from Listing 1. Lines 7-10 declare a backend function. In this particular case, that will be executed on ServiceX. It is used in the function dr defined at line 28. Lines 24 and 25 define good MC and Data electrons that we'll use in comparison. The function associate_particles defined at line 27 associates the source (electrons) to the pick_from (MC electrons). This is done using the very_near function defined at line 32 that returns a list of MC electrons that are with in ∆R < 0.1. Note the lambda capture grabs the electron in the form of the variable p that is passed in. Not all electrons have a match, hence the has_match being defined at lien 38, and using it to protect the matched particle definition in line 40. Note that liens 38-40 extend the data model so data electrons now have the bool has_match and an Monte Carlo particle mc. These can be referred to, as they are in line 46, for easy difference plotting.

Status
A prototype exists in github and is working. This prototype was used to develop the concepts in the code. It dispatches code to run against ServiceX and awkward as mentioned earlier. There are, as described, clear short-comings with the code. At the time of writing, a new version is being created to fix a number of the architectural mistakes that were made during prototyping.
• The ServiceX backend is being upgraded to handle the full range of supported func_adl queries to reduce the amount of data that needs to be shipped from ServiceX to an analysis cluster.
• A backend that supports the coffea library is being created. This includes the ability to run processors in multiple places and support for the coffea data model (called nanoaod).
One of the biggest benefits here will be the ability to run on clusters. This will replace the awkward_array plug-in, as awkward is used by coffea. • A full type system is being implemented. If known, this means the system will not have to infer types, and reduce mistakes. It will also mean errors are flagged earlier (e.g. accessing data that doesn't exist).
• As the users intent is known at the start, it is possible to cache the users query. Thus the second time the same query is run it should take very little time -just a lookup. This is currently implemented by the ServiceX backend. We are investigating adding this more generically to the system.
• Histogram definition and filling is being implemented as something understood by dataframe_expressions. This will allow histograms (or other similar objects) to be filled in parallel and then combined.
hep_tables approach, like many in python, does have some issues. Some are a function of this implementation and some are a function of python's programming model.
• Assumptions are made about which level an array operation occurs. For example, does a.jets.Count() count the number of jets in each event (returning an array), or count the number of events (returning a scalar)?
• While lambda capture enables many important patterns, and makes it more clear what is going on, its readability is not the most straight forward. This is a generic problem with python: it is not yet possible to analyze the source code of a python pragmatically in a robust way. The double-loop example in Section 2.1.
• It isn't obvious how functions over sequences should work when more than one sequence is involved. For example, abs(a.jets.pt) should return an array of the absolute value of jet pt's. But what about the tuple (abs(a.jets.pt), abs(a.jets.pt))? Should that return a single array, with two entries in each? Or should it return a 2D array?
• Many common libraries, like seaborn, are used for plotting. This is temptingly close to being able to use these libraries. However, these libraries usually require the data in-memory, and that can be quite expensive for a large analysis dataset.
• Loop algorithms are not well suited to this style of declarative programming. For example, track finding -where you don't have a definite number of steps.
• Control logic is not captured -if a decision has to be made on the result of a query, the query must be rendered first. As one thinks forward to adopting differential programming, as the JAX package [10], this problem will have to be better understood.

Conclusions
This paper has described the dataframe_expressions and hep_tables package. The former records the users intent by tracking common and extended array operations. The latter renders the actions across multiple backends. This project was started as a prototype to understand if something like this was possible, without having to re-write the complete eco-system. As both numpy and awkward_array use a dispatch mechanism, this turns out to be possible.