Recent advances in ADL, CutLang and adl2tnm

This paper presents an overview and features of an Analysis Description Language (ADL) designed for HEP data analysis. ADL is a domain specific, declarative language that describes the physics content of an analysis in a standard and unambiguous way, independent of any computing frameworks. It also describes infrastructures that render ADL executable, namely CutLang, a direct runtime interpreter (originally also a language), and adl2tnm, a transpiler converting ADL into C++ code. In ADL, analyses are described in human readable plain text files, clearly separating object, variable and event selection definitions in blocks, with a syntax that includes mathematical and logical operations, comparison and optimisation operators, reducers, four-vector algebra and commonly used functions. Recent studies demonstrate that adapting the ADL approach has numerous benefits for the experimental and phenomenological HEP communities. These include facilitating the abstraction, design, optimization, visualization, validation, combination, reproduction, interpretation and overall communication of the analysis contents and long term preservation of the analyses beyond the lifetimes of experiments. Here we also discuss some of the current ADL applications in physics studies and future prospects based on static analysis and differentiable programming.


Introduction
High energy physics (HEP) experiments are collecting unprecedented amounts of data. In order to explore these data for hints of new physics, or to perform high precision measurements, physicists are designing an ever growing number of elaborate analyses. The physics content of these analyses consists of defining objects and variables used for classifying events as signal or background, selecting events, re-weighting simulated events to improve their agreement with real events, estimating backgrounds, and interpreting experimental results by comparing them to theory predictions. These tasks are traditionally performed using analysis software frameworks that organize the tasks into a computational pipeline. The frameworks integrate a diverse set of operations from data access to event selection, from histogramming to statistical analysis. These frameworks are written in general purpose languages (GPLs) CutLang runtime interpreter and framework, along with language enhancements required for this approach, while Section 4 will introduce adl2tnm . Section 5 will summarize the current physics implementations and uses. Section 6 will introduce prospects for static analysis and differentiable programming, followed by the conclusions in Section 7.

ADL overview: File and functions
ADL hosts the physics content of an analysis in a plain, easy-to-read text file called the ADL file. The ADL file consists of blocks containing one or more keyword-value/expression structures: blockkeyword blockname keyword1 expression1 keyword2 expression2 keyword3 expression3 # comment Blocks separate analysis components into semantically clear concepts such as object, variable and event selection definitions. Keywords specify HEP analysis concepts and operations such as selection, weighting, binning, etc. For example, the select keyword used for object or event selection is followed by a value resulting from an arbitrarily intricate boolean expression. Tables 2 and 1 list the blocks and keywords currently recognized in ADL. The syntax includes the following operators: • It also includes some standard, general functions such as:  Sometimes analyses contain variables with complex algorithms non-trivial to express with the ADL syntax (e.g. M T 2 , razor, aplanarity, etc.) or non-analytic variables (e.g. object or trigger efficiency tables, machine learning models, etc.). ADL handles these variables systematically by having them encapsulated in self-contained, external, standalone functions that can be referenced from within an ADL file. Throughout the ADL file, the mass, energy and momentum are all written in units of GeV and angles in radians. User comments and explanations are preceded by a hash (#) sign. An example ADL file for a CMS analysis (CMS-SUS-16-037) is shown in Figure 2. ADL, in its current state, can express many standard physics tasks such as object selections based on features, basic object reconstructions, variable definitions, event selections, event weighting, etc. However it still has some missing features. For example, ADL has no generic way to describe arbitrary combinations of objects to form new ones (e.g., the reconstruction of all possible top quark candidates from the boosted or resolved decay modes). The prototype cannot describe low level objects (e.g. hits, cells), or non-standard objects like long-lived particles (e.g. disappearing tracks, displaced muons, etc.). There is yet no way to Provide information about the analysis info add new object attributes or define object associations (e.g., between a jet and its constituent particles or a track and its associated hits). Moreover, ADL needs to be extended with syntax to specify and apply systematic uncertainties. Constant work is ongoing to identify and incorporate these features and evolve ADL into a domain complete language.

The CutLang interpreter and framework
Runtime interpretation is a very practical approach, that allows instant modifications in an analysis such as adding new variables or selection criteria, changing the execution order or cancelling analysis steps, and evades the modify-compile-run cycle. Not having to compile ADL content into a framework automatically provides the flexibility to run multiple analyses in parallel. CutLang runtime interpreter and frameworks are developed to demonstrate the feasibility of this approach. CutLang runtime interpreter is a C++ program utilizing function pointer trees to represent different operations used in event selection and other relevant functions such as filling histograms. In this approach, processing an event through a cutflow list becomes equivalent to traversing multiple expression trees, such as the one shown in Figure 3, with arbitrary complexities. The physics objects to be used are therefore given as arguments to these functions.
Further functionalities such as handling of the Lorentz vector operations, pseudo-random number generation, input-output file and histogram manipulations are all based on classes of the ROOT data analysis framework [15]. The ADL text itself is parsed in CutLang , automatically to generate dictionaries and grammar using formal tools Lex and Yacc [16]. The ADL file is split into tokens by Lex, and the hierarchical structure of the algorithm is found by Yacc. Since these tools are traditionally found in all Unix-like systems, CutLang can be compiled and operated in a multitude of modern Operating Systems. The interpreter is compiled only once, during the installation or if an external user function is added. Once the work environment is prepared, the remaining work consists mostly of thinking, editing, running and observing. Multiple input data formats are implemented as plug-ins into the CutLang framework. Some of the event types that are recognized and can be directly used are ATLAS and CMS open data [17], CMS NanoAOD [18], Delphes [19] and LHCO. CutLang has also its own internal format called LVL0. New input file types can also be added easily: an abstraction layer defining all particle types and event properties decouples internal data from input data formats. The only requirement on the input files is to use ROOT file format. If CutLang does not provide by default the necessary methods to access some information (such as an attribute of a particle) in a particular input data type, that particular information can be easily accessed through external user functions.
In the present design, achieving runtime interpretation inherently relies on the ADL file to comply with a certain structure and content. For example, CutLang runtime interpreter processes the commands in the ADL file on events from top to bottom. All information, e.g. a variable name, required at a stage must be available when CutLang arrives at that stage. Therefore, in order to be processed with CutLang , the description of the analysis content needs to be given in a well-defined order. According to this order, an ADL file starts with an initialization section containing commands related to analysis information and initialization. This is optionally followed by a counts section, used for setting up the recording of already existing event counts and errors, e.g., from an experimental paper publication, if such counts are needed to be recorded for statistical analysis. Next section is the definitions1 used for defining aliases for objects and variables. This is followed by the objects section that defines new objects based on predefined physics objects and shorthand notations declared in defi-nitions1. Next comes the definitions2 section used for defining more objects and variables based on all the available objects. The current implementation permits only two object definitions sections. The final part consists of the event categorization section that defines event selection regions, criteria in each region, event weighting and event histogramming. CutLang requires at least one selection region with at least one command, which may include either a selection criterion or a special instruction to include MC weight factors or to fill histograms.
CutLang also incorporates a complete analysis framework designed to run a full event analysis and output information and data that would be used for further study. The main output file in the ROOT format includes a copy of the ADL file content in order to report the provenance of the analysis. The output file contains a directory for each event categorization region, i.e. each region block. Under each directory, it stores histograms with the event counts and uncertainties obtained from the analysis together with all histograms filled by the user. CutLang is also capable of saving the currently surviving events at any stage of the running algorithm. The events are saved into a dedicated user-defined ROOT [15] file (without the .root extension) using the command save. It is possible to save multiple times in a single algorithm (region) at different stages of the algorithm. The events in the output file are saved in the native LVL0 format of CutLang . The ROOT file also stores the saved events in case it is declared at ADL file level. It is possible to register various signal, background or data counts of a region together with their associated errors for some studies such as phenomenological interpretation or validation. CutLang has also the capability of multi-threaded execution of an analysis to optimally utilize the available resources. Adding the usual -j n to the command to start the analysis results in using n number of cores. The parameter n is to be an integer between 0 and total number of cores on the processor, where 0 represents a value one less than total number of cores to maximize the performance while leaving the operating system part of the resources. The optimization choice in CutLang is to parallelize over the events, which are distributed equally over the available cores. A simple study, presented in Figure ?? shows that the optimal number of parallel processes should be equal to the number of physical cores. Moreover, the same study showed that the multi-threading performance gets better with the increased number of events in the analysis. This can be understood in terms of the file opening and closing overhead becoming unimportant as the total event processing time increases. As for the single thread performance, it was shown that the interpreter speed is about 20% slower than the compiled code when used in a realistic analysis scenario [20].
CutLang currently includes all language features explicitly listed in Section 2. It was tested with various physics analyses, used in one published phenomenology study [21] and used as a training tool as will be described in Section 5. Yet, improvements are still needed in two areas in order for it to be usable for full scale experimental analyses. CutLang does not yet have an automated mechanism to incorporate input data formats including a complete set of objects and methods. It also requires further automation in the incorporation of external user functions. CutLang source code is publicly available in GitHub [22]: https://github.com/unelg/CutLang Recently, the GitHub platform was used to incorporate a continuous integration setup for automatic validation of the code via predefined test analyses.

adl2tnm transpiler for the TNM framework
The adl2tnm transpiler is a Python program that translates an ADL file to a C++ program that can be executed within the TNM (TheNtupleMaker) framework, an automated generic ntupling and analysis framework for CMS studies. Note, however, that the analysis component depends only on ROOT and not on any CMS data structures, therefore serving as a generic ntuple-based analysis framework. The workflow of adl2tnm is shown in Figure 5.
In principle, adl2tnm can work with any simple ntuple event format, e.g. Delphes [19], ATLAS and CMS analysis ntuples such as CMS NanoAOD [18], etc. adl2tnm has an adapter mechanism capable of semi-automatically reading the input event format and incorporating it into the C++ code. adl2tnm operates by assuming the availability of a standard, extensible type for analysis objects, and has internally implemented such a type. Its adapter mechanism translates the input object types to the standard extensible types. The assumption of the standard extensible types is not an imposition on ADL itself, but rather is an aid to the writing of transpilers and interpreters for ADL. The extensible type approach is aimed as a generic solution to handle the reality that different input types can, and do, have different attributes and sometimes identical attributes with different names. For example, the transverse momentum of a particle may be called PT, in Delphes, while the same attribute may be called Pt in other input types. Therefore, the extensible type used by adl2tnm uses the attribute names of the input types. The attributes are modeled as a map between a name (as a string) and a floating point value. adl2tnm produces analysis output in a ROOT file with a content similar to that produced by CutLang .
adl2tnm does not impose an order within the ADL file. The adl2tnm transpiler extracts all blocks from an ADL file and places them within a data structure that groups the blocks according to type. The blocks are then ordered according to their dependencies on other blocks.
The development of adl2tnm started during the initial phase of LHADA. The transpiler is not based on formal tools such as Lex & Yacc as in the case of CutLang . Though it was tested successfully in processing several analyses in comparison to CutLang , it still misses the implementation of several ADL features. Work is in progress to re-build adl2tnm in a more formal way through the use of formal grammar building and parsing tools. The current version is publicly available in GitHub [23]: https://github.com/hbprosper/adl2tnm

Physics studies
Up to now, various analyses, mainly from LHC new physics searches, have been implemented with ADL. The primary goal of these implementations so far has been to determine the approximate range of physics content and design ADL syntax to address this content. Implementing analyses with a variety of physics content led to incorporating a wider range of object and selection operations and helped to make the ADL syntax more generic and inclusive. Consequently, the scope and functionality of CutLang interpreter and adl2tnm transpiler and frameworks were also significantly enhanced. These ADL analyses are being collected in the following GitHub repository [24]: https://github.com/ADL4HEP/ADLLHCanalyses These ADL implementations have been tested with CutLang and partially with adl2tnm . Some of them were also validated in comparison to other analysis frameworks in dedicated exercises performed during the Les Houches PhysTeV workshops, [11] (Contribution 19). The phenomenology tool SModelS [25][26][27] which decomposes a given new physics model into a set of simplified final states, and uses the experimental limits from various analyses on these simplified final states to obtain the sensitivity to the model is adapting ADL and CutLang to compute the analysis selection efficiencies of the simplified model final states.
More recently, ADL and CutLang were used in a study estimating the sensitivity of the High Luminosity LHC and the Future Circular Collider to models with down-type isosinglet quarks [21]. Furthermore, an analysis example to run on CMS Open Data [17] was implemented. In addition, ADL and CutLang were used as main tools in an analysis school which took place in Istanbul in February 2020 for undergraduate students, where several analyses were implemented by the participating students [28]. ADL and CutLang were also employed in hands-on exercises for data analysis at the 26th Vietnam School of Physics (VSOP) in December 2020 [29], where the exercises were adapted to be performed via Jupyter notebooks [30]. The experience in both schools established ADL and CutLang as highly intuitive tools for introducing HEP data analysis to beginner level students.

Prospects for static analysis and differentiable programming
The formal domain specific, declarative syntax and the well-defined structure of ADL makes it an ideal construct for implementing static analysis and differentiable programming. Moreover, having the analysis described in an independent text file decoupled from framework code greatly aids such tasks.
The act of parsing source code and deducing facts about it without actually running the code is called a static analysis. Static analysis of a database of physics analyses implemented with ADL can be used to assist and automate query among or comparison between multiple analyses in the space of event properties. This helps to find out which event final states are covered or not, and which analyses have disjoint or overlapping selection regions. The practical features of ADL makes such comparison tasks possible to some extent "by eye", even without formal static analysis. In any case, this information can in turn be used to combine multiple analyses or design original ones. Recently, we started to develop prototype tools for analysis queries and comparisons. The tools are designed to have various options to perform these tasks, i.e. via static analysis, via using physics events or via using randomly sampled events. A more detailed description of these methods can be found in Contribution 17 of [11], and a preliminary version of the tools can found in the repository [23].
The task of a HEP data analysis can be viewed as a mathematical function, which takes as arguments signal, background and observed events, various cross sections, and interfaces with a statistical tool providing a desired output, such as a measure of statistical significance for a sought signal or a measurement (e.g., of a cross section) and its associated uncertainty. The mapping from events to desired outputs is an optimization problem. For example, in the case of expected statistical significance, the goal is to maximize it. For a measurement, the goal may be to reduce the expected relative measurement uncertainty. Therefore, a HEP data analysis fits into the broad class of optimization problems whose solution is, in principle, amenable to optimization using gradient descent or ascent. Such problems may be effectively handled via differentiable programming where the analysis elements such as selection thresholds are treated as differentiable parameters. The ADL approach is particularly suited to this approach as it uniquely and systematically organizes the description of these parameters. A dedicated effort has started in the HEP community towards building automatic differentiation tools to make analyses completely differentiable and in particular developing differentiable replacement analogues for non-differentiable operations such as binning and sorting that are common to HEP analyses [31]. ADL will be combined with these emerging tools to obtain differentiable analyses.

Conclusions and outlook
In this paper, we presented the concept and recent developments in a domain specific, declarative and framework-independent Analysis Description Language for HEP analyses. We gave an overview of the current ADL syntax, which is already reasonably sufficient for describing the physics content of a large number of analyses. We then presented the two main tools developed to render ADL executable, namely the runtime interpreter CutLang and the transpiler adl2tnm . Both tools can already be used for processing various analyses on events and produce meaningful output that can be used in further statistical studies. Currently, CutLang supports a wider range of ADL features while adl2tnm has a more automated way of handling input data formats and external functions. We also discussed the prospects of ADL for statistical analysis and differentiable programming and presented the existing and ongoing physics applications. All these studies demonstrate the feasibility, effectiveness and potential of ADL, and establish motivation to pursue this initiative and its diverse applications. Up-todate information about ADL, CutLang , adl2tnm and various applications is systematically documented at the project's web portal cern.ch/adl [32]. Studies will continue towards developing ADL into a domain complete language, improving the functionality and robustness of CutLang and adl2tnm , to build new tools making use of ADL's potential and practicality, and to explore a large variety of physics applications.