Awkward Just-In-Time (JIT) Compilation: A Developer’s Experience

,


Introduction
Awkward Array library [1] offers its users an array-oriented programming style in a dynamically typed language.The advantage of the array-oriented calculations is that they fit the data analysis workflow better, considering the large amounts of data which a physicist typically loads into an interactive environment.Moreover,the code that performs one operation on all data is easy enough to write, then observe the intermediate results on distributions, and reiterate.The imperative code, on the other hand, is better for the stage after the interactive exploration, but the performant imperative programming requires compilation.The loops may take longer to write, especially when fitting into a Just-In-Time (JIT)-compiler and getting the compile-time types correctly, but are often self-explanatory and faster.It is not uncommon for a user analysis code to be expressed in a traditional high-performance, statically typed language -C++, and a statically typed language requires compilation.
The JIT-compilation makes it convenient to compile in an interactive Python environment.Several JIT-compilation techniques are used to achieve the desired acceleration.The Awkward Array functions JIT-compile a user's code into executable machine code.They use different techniques, but reuse parts of each others' implementations.
The techniques discussed in this article are focusing on integrating RDataFrame [3], cppyy [4], and Numba [5], particularly Numba on GPUs.These include Awkward Arrays to and from RDataFrame conversions, the standalone cppyy integration, passing Awkward Arrays to and from Python functions compiled by Numba, passing Awkward Arrays to Python functions compiled for GPUs by Numba, and populating Awkward Arrays from C++ without any Python dependencies using a header-only library.

Awkward Arrays and RDataFrame
The Awkward Arrays can be converted to the RDataFrame columns and the RDataFrame columns can be converted to Awkward Arrays as described in detail in [6] and [7].
The user benefits from a faster execution of both the ROOT C++ functions and the userdefined pure C++ functions.Here is an example of such conversion shown beneath.
The handle to this Array view is a lightweight 40-byte C++ object allocated on the stack.The generated RDataSource takes pointers into the original array data via this view.Next, the column readers are generated based on the run-time type of the views.Finally, the readers are passed to a generated source derived from ROOT::RDF::RDataSource.
The ak.from_rdataframe function converts the selected columns as native Awkward Arrays.The templated C++ header-only implementation and the dynamically generated C++ code are used to extract the columns' types and data.
On the other hand, those who prefer C++, ROOT, and RDataFrame, have an ability to convert their data into Awkward Arrays in order to leverage the tools available in the wider Scientific Python ecosystem.

Standalone cppyy
Like ROOT and RDataFrame, cppyy allows a user to write C++ and JIT-compile it to use it from Python.The C++ code can be included to Python from a C++ file or written directly as a Python string.But cppyy can be installed without the entire ROOT package and Awkward Array's interface to it is more low-level: users can write C++ functions that operate on whole arrays, multiple arrays, and return arrays.
The ak.Array, the Python class for all Awkward Arrays, implements a magic function __cast_cpp__ that is called by cppyy to determine a C++ type of the array.The Numba implementation [8] is reused here to generate a C++ array view on demand.The generated ArrayView C++ class hashed type is registered as a cpp_type Python string attribute of the ak.Array class.The cppyy maps the C++ class type as a string to a Python type.The down side is that the user cannot redefine the function.Each function must have a unique name.This is similar to the PyROOT interpreter, because an earlier version of cppyy is used to bind Python and C++ in PyROOT.
The user does not need to know what cpp_type is -the cpp_type is generated on demand when the array needs to be passed to the C++ function.The cppyy version that is used must be 3.1 or later.The cppyy library can construct an object of a previously declared type based on an arguments.This new feature is not available in the earlier versions.cppyy implements an implicit instantiation from __cast_cpp__ returning a tuple.

Awkward Arrays and Numba
Numba is a JIT-compiler of functions with Python syntax, and Awkward Arrays can be passed to and from Numba-compiled functions with a similar interface as cppyy.Numba can be used in contexts in which acceleration is needed, but C++ is not-it allows users to write accelerated code in the same Python language as the rest of their code, albeit in a subset of that language (not all Python code can be compiled).
Numba infers the argument types at call time, and generates optimized code based on this information.Numba also compiles separate specializations depending on the input types.The implementation to pass Awkward Arrays to and from Python functions compiled by Numba defines a numba_type property of an ak.Array that is a type of a generated C++ Array view, as discussed before.Awkward Arrays can be iterated over in Numba-compiled functions using zero-copy views.
Numba can also compile code for Nvidia GPUs through LLVM's CUDA backend, and Awkward Arrays can be passed as arguments into such functions.Currently, the extension needs to be defined by a user in a function decorator (as extensions=[ak.numba.cuda]below), but this requirement will be removed in a future version of Numba.

Header-only libraries
To create Awkward Arrays in C++ without any dependence on Python, we provide a separate set of header-only libraries that implement an awkward::layoutbuilder [9] that builds up an array by appending elements and then shares that array through basic C types (raw array buffers and a JSON-formatted string the convey structure).