C++ Code Generation for Fast Inference of Deep Learning Models in ROOT/TMVA

We report the latest development in ROOT/TMVA, a new tool that takes trained ONNX deep learning models and emits C++ code that can be easily included and invoked for fast inference of the model, with minimal dependency. An introduction to SOFIE (System for Optimized Fast Inference code Emit) is presented, with examples of interface and generated code. We discuss the latest expanded support of a variety of neural network operators, including convolutional and recurrent layers, as well as the integration with RDataFrame. We demonstrate the latest performance of this framework with a set of benchmarks.


Introduction
Since 2005, as part of the ROOT Data Analysis Framework [1], the Toolkit for Multivariate Analysis (TMVA) [2] has provided an environment for the training and evaluation of a large variety of machine learning methods for data analysis in High Energy Physics and other scientific fields. For example, TMVA provides the training and inference of boosted decision trees (BDTs), which have been a popular algorithm for classification and regression among high energy physicist, contributing even to the Higgs discovery in 2012 [3][4][5][6].
The emergence of contemporary neural network design has revolutionized machine learning in recent years. With the emergence of deep learning, large technological companies' software solutions, such as TensorFlow, MXNet, and PyTorch [7][8][9], began to emerge and gradually dominated the landscape. Nowadays, numerous high-energy physics workflows, such as the CMSSW [10], have embraced and integrated these external technologies.
SOFIE [11], the latest development work in ROOT/TMVA, aims to provide a convenient solution for users to conduct inference of deep learning models trained by these frameworks in a C++ based production environment. It takes a trained ONNX [12] model as input and outputs snippets of C++ code that hard-code the inference function in a header file. This function can therefore be readily included and run from any C++ project, with the only dependency on linear algebra libraries such as BLAS (Basic Linear Algebra Subprograms). A dependency on Google Protocol Buffers (protobuf) is required for 2 parsing the ONNX model file, although it is not required for using the output C++ code. This allows for an easy integration of the generated code in other C++ based HEP software frameworks.
There have been plenty of effort in this area, including frameworks developed by tech companies such as ONNX Runtime [13] as well as solutions developed by the high energy physics community, such as Lightweight Trained Neural Network (lwtnn) [14]. Aside from software that provides inference on CPUs and GPUs, inference tools that specialize on other hardware, such as hls4ml [15], which emits FPGA implementations of machine learning algorithms, have been developed in recent years. Another train of thought is to provide inference as a service with integrated GPUs and FPGAs, such as SONIC [16]. Despite the many options available, many of the solutions still have a lot of dependencies or come with a runtime that is memory-heavy. SOFIE provides an alternative lightweight, plug-and-use solution in this front.
In particular, SOFIE now accepts ONNX, Keras, PyTorch and ROOT models. We have also expanded the supported operators to include Conv, Pool, RNN, GRU, LSTM, BatchNorm others. Note that SOFIE is designed to be modular, so that users can easily contribute to it by adding custom operators without having to understand the core of the framework. SOFIE is also designed to be thread-safe so that multi-threaded support could be vastly expanded in the future.

Examples of Interface and Generated Code
SOFIE is easy to use. To generate code, the user first needs to use a parser to parse the neural network format into an RModel object. Different parsers are being developed to support ONNX, ROOT, Keras and PyTorch formats to varying degrees. From the RModel object, we can easily generate code. As this code is run a header file named "model.hxx" is generated. To infer with this generated code, simply include it and call the pre-defined function. A closer look into the generated code reveals that SOFIE works by unrolling the logics of each operator to plain C++ code that allows maximal compiler optimization. For example, the new inference engine parses a general transpose operator as described by the ONNX Operator standard, with the hyperparameter permutation set to [3,2,1,0] (i.e., it permutes a tensor of shape [1,2,3,4] to a new shape of [4,3,2,1]), and emits the code snipper below.

Integration with Datagram
As part of modern analysis workflow in ROOT, users can easily carry out multi-threaded inference of their deep learning models on data in RDataFrame. With the functor defined as:

Performance Benchmark
Performance benchmark of SOFIE on linear/dense/fully connected layers has been demonstrated before and presented at CHEP 2021 [17]. Here we reproduce the results performed on a fully connected network with 10 layers of 50 neurons wide each, coupled with the ReLu activation function. Our latest development work on SOFIE included expanded support for convolutional neural network. Here, we performed inference on a convolutional network of filter size (5,5) and channel number varying from 1 to 128, on input data consisting of images of size (100,100). We have also performed benchmark on a full-fledge CNN architecture of resnet18. Here, a similar pattern to that of linear layers emerge. For simpler models like a 1-layer conv-based network, we achieved better performance than ONNX Runtime. For more complex models with higher no. of layers and more complicated flow of data through the network operators, we achieve slightly worse performance comparable with ONNX Runtime. The likely reason for this pattern is that ONNX Runtime maintains a running process during inference process that allows for greater optimization, but at a cost of greater overhead. For larger, more complex model, this is more likely to pay off.

Conclusion and Future Work
We reported the latest development work regarding the inference code generation engine SOFIE in ROOT/TMVA. Specifically, we reported on the expanded support for input data format (ROOT, Keras, PyTorch) and operators (Conv, RNN, LSTM, GRU, BatchNorm…) as well as integration with RDataFrame for convenient inference of data in modern ROOT data analysis workflow. The latest SOFIE has been included in ROOT 6.24 experimental.
As the next step in our plan of development, we aim to invest greater effort in further inference speed optimization, including deeper investigation of operator-level optimization and compiler optimization. We would also like to expand operator support according to our users' feedback and demand. Finally, we intend to improve interoperability with other ROOT modern analysis facilities such as RDataFrame.