Fast and resource-efﬁcient Deep Neural Network on FPGA for the Phase-II Level-0 muon barrel trigger of the ATLAS experiment

. The Level-0 muon trigger system of the ATLAS experiment will undergo a full upgrade for the High Luminosity LHC to stand the challenging requirements imposed by the increase in instantaneous luminosity. The upgraded trigger system will send raw hit data to o ﬀ -detector processors, where trigger algorithms run on a new generation of FPGAs. To exploit the ﬂexibility provided by the FPGA systems, ATLAS is developing novel precision deep neural network architectures based on trained ternary quantisation, optimised to run on FPGAs for e ﬃ cient reconstruction and identiﬁcation of muons in the ATLAS "Level-0" trigger. Physics performance in terms of e ﬃ ciency and fake rates and FPGA logic resource occupancy and timing obtained with the developed algorithms are discussed.


Introduction
The high-luminosity phase of the Large Hadron Collider (HL-LHC) at CERN is expected to start operation in 2027, to ultimately reach a peak instantaneous luminosity of L = 7.5 × 10 34 cm −2 s −1 , corresponding to approximately 200 inelastic proton-proton collisions per bunch crossing, which delivers to the ATLAS experiment [1] more than ten times the total integrated luminosity collected in all previous LHC runs. Meeting these requirements poses significant challenges to the ATLAS trigger and DAQ systems to fully exploit the physics potential of the machine. To be able to handle the amount of data produced at peak luminosity by the HL-LHC the detectors of the ATLAS experiment will undergo a full upgrade. In particular, the Level-0 muon barrel trigger will be improved with the addition of a Resistive Plate Chamber (RPC) station (BI chamber) in the innermost radius of the Muon Spectrometer [2], and by moving the trigger logic off-detector, where flexible algorithms run on latest generation Field Programmable Gate Array (FPGA) processors [3]. Classification and regression methods based on modern Machine Learning (ML) are well suited to solve the limitations in terms of performance and flexibility of the conventional algorithms, and they can be a promising and viable solution to exploit the flexibility of FPGAs for real-time applications in the LHC detector triggers. In this work, we explore the implementation of deep convolutional neural networks in FPGAs based on trained ternary quantisation networks [4,5], optimised to cope with the tight requirements in terms of resource usage (O(30%)) and latency (O(1µs)) imposed by the FPGA architecture and trigger constraints. We demonstrate that it is possible to reach state-of-the-art performance in muon reconstruction and identification in the ATLAS Level-0 muon trigger with microsecond latency.

Related work
ML inference on FPGAs has received increasing interest in recent years [6,7], and several studies have been recently presented in the context of high energy physics (see for example: [8][9][10]). In this work, we get inspiration from these and other studies but follow an approach more oriented in achieving a robust and working implementation for a specific deep neural network architecture. In particular a ternary convolutional neural network with performance matching the requirements of the ATLAS Phase-II Level-0 muon trigger.

Conventional RPC-based trigger algorithm
A conventional Phase-II RPC-based trigger algorithm ("Standard Algorithm" in the following) has been implemented in the ATLAS simulation [3], and it is a direct extension of what has been used before the upgrade with the additional BI RPC station (see Figure 1). It operates as a pattern-finding algorithm, as illustrated in Figure 2. For each hit found in a predefined detector layer (usually the innermost layer), a coincidence window is opened toward the adjacent layer. The dimension of the window is inversely proportional to the transverse momentum (p T ) threshold of the trigger so that if the muon p T is not high enough, the magnetic field will curve the particle outside the coincidence window of the next layer. If a new hit is found, the process is recursively repeated until all the layers are analysed. If the number of hits is greater than a given threshold, then the event is triggered. To compare performance with the deep neural network algorithm, the configuration which requires at least three hits out of four RPC stations is required. The Standard Algorithm is reliable and fast. However, some limitations in terms of robustness are observed. The geometrical acceptance of the coincidence window and configuration logic set an upper limit on the maximum efficiency. Moreover, the algorithm effectively measures the muon momentum from the deflection of the trajectory with respect to a straight line from the interaction point, limiting the possibility to trigger with high efficiency neutral long-lived particles decaying in muons.

Deep Neural Network approach
To overcome the limitations of the Standard Algorithm, we have used an ML-based approach based on the implementation of a Convolutional Neural Network (CNN) on an FPGA. CNN is a regularised version of multilayer perceptrons well known to excel in classification and regression task analysing visual imagery. As shown in Figure 3 [11], RPC detector strips can be arranged in image-like objects, to be fed to CNN as training inputs. Each bin of the vertical axis corresponds to a detector layer (3 detector layers for the inner station, 4 for the middle and 2 for the outer station). The horizontal axis maps the η coordinates of each physical RPC strip: for the i-th strip η i bin = 384 η i −η min η max −η min , where η max and η min are respectively the maximum (η max = 0.95) and the minimum (η min = 0.07) η values for the barrel RPC strips chosen to prevent muons from falling outside any layer of a specified sector; and 384 is a realistic number of strips per layer. This provides a convenient representation for the RPC hits data, in which an infinite momentum muon appears in the image as a vertical pattern of pixels, independently of the pseudorapidity η, while lower momentum muons appear ideally as inclined pixel patterns with slopes inversely proportional to the muon p T .
Training data for the CNN is based on detailed Monte Carlo simulation of the ATLAS Phase-II detector, including realistic geometry, resolution effects, and cavern background evaluated from minimum bias events at HL-LHC peak luminosity conditions. Events with multiple muons are built by combining multiple single muons events with the cavern back- : An example input to a Convolutional Neural Network (CNN) for the Phase-2 ATLAS Level-0 muon trigger, implemented on a FPGA, is shown for a 4 GeV muon without background. Resistive Plate Chambers (RPC) hits of a fixed sector are arranged in a matrix-like object. Each bin of the y-axis corresponds to a detector layer (3 detector layers for inner station, 4 for the middle and 2 for the outer station). The x-axis maps the ⌘ coordinates of each physics RPC strip: for the i-th strip ⌘ i bin = 384 ⌘i ⌘min ⌘max ⌘min , where ⌘ max and ⌘ min are respectively the maximum (⌘ max = 0.95) and the minimum (⌘ min = 0.07) ⌘ values for the barrel RPC strips chosen to prevent muons to fall outside any layer of a specified sector; 384 is a realistic number of strips per layer. This particular choice has been taken in order to evaluate ML algorithm performances, without any geometrical acceptance e ect. Random background has been added. The background rate has been evaluated from minimum bias events. Events used in the training phase of the CNN can also contain two or more muons in the same sector. Events with more than one muon are built superimposing one muon images with no background, which is then included. The CNN output is set to evaluate transverse momentum and ⌘ of the leading and sub-leading muons (if the latter exists) in the sector and returns also a flag for events that contain more than 2 muons. . The x-axis maps the ⌘ coordinates of each physics RPC strip: for the i-th strip ⌘ i bin = 384 ⌘i ⌘min ⌘max ⌘min , where ⌘ max and ⌘ min are respectively the maximum (⌘ max = 0.95) and the minimum (⌘ min = 0.07) ⌘ values for the barrel RPC strips chosen to prevent muons to fall outside any layer of a specified sector; 384 is a realistic number of strips per layer. This particular choice has been taken in order to evaluate ML algorithm performances, without any geometrical acceptance e ect. Random background has been added. The background rate has been evaluated from minimum bias events. Events used in the training phase of the CNN can also contain two or more muons in the same sector. Events with more than one muon are built superimposing one muon images with no background, which is then included. The CNN output is set to evaluate transverse momentum and ⌘ of the leading and sub-leading muons (if the latter exists) in the sector and returns also a flag for events that contain more than 2 muons. . The x-axis maps the ⌘ coordinates of each physics RPC strip: for the i-th strip ⌘ i bin = 384 ⌘i ⌘min ⌘max ⌘min , where ⌘ max and ⌘ min are respectively the maximum (⌘ max = 0.95) and the minimum (⌘ min = 0.07) ⌘ values for the barrel RPC strips chosen to prevent muons to fall outside any layer of a specified sector; 384 is a realistic number of strips per layer. This particular choice has been taken in order to evaluate ML algorithm performances, without any geometrical acceptance e ect. Random background has been added. The background rate has been evaluated from minimum bias events. Events used in the training phase of the CNN can also contain two or more muons in the same sector. Events with more than one muon are built superimposing one muon images with no background, which is then included. The CNN output is set to evaluate transverse momentum and ⌘ of the leading and sub-leading muons (if the latter exists) in the sector and returns also a flag for events that contain more than 2 muons. 8 Figure 3. Examples of RPC event images used to train the CNN [11]: (left) an event with one low-p T muon (p T = 4 GeV); (center) an event with two high-p T muons (15 and 12 GeV respectively); (right) an event with three muons and background noise due to pileup and cavern background. ground. A total of one million images with muons in the range 3 -20 GeV p T has been used, divided between training, validation and testing sets.
Two CNN models have been trained in order to assess the performance of the new algorithm, one is a benchmark network based on state-of-the-art floating-point CNN implementation, based on a simplified VGG architecture [12], and the other one is based on a ternary-CNN (tCNN) [4,5], that addresses the limited storage and computational resources imposed by the use of an FPGA. It can be realised by constraining weights and activation function in the network to be ternary-valued: +1, 0 and -1. The weights and neuron outputs in a tCNN can be represented using just two bits per weight instead of 32 as it would have been for a floating-point CNN, resulting in a 16-times larger compression in terms of memory occupation and simpler implementation in the FPGA firmware with better performance in term of latency.
The neural network architecture is illustrated in Figure 4. Convolutional [13] and Max-Pooling layers [14] have (4,3) and (4,1) kernels respectively, and the activation function for the hidden layers is ReLU [15] for the CNN and deterministic ternary activation for the tCNN, while a sigmoid activation [16] is used in the output layer in order to describe continuous values in output. Batch normalisation layers [17] with momentum are used in both the convolutional and fully connected stages. Both networks are trained to predict a five-component vector (p lead T , η lead , p sub−lead T , η sub−lead , n muons ), where "lead" stands for leading (i.e. the muon with the highest p T ) and n muons represents the number of muons in an image. The MSE [16] loss function is minimised using the Adam algorithm [18] with an initial learning rate of 10 −3 and a minibatch size of 64. Performance of the models on test samples are shown in Figure 5 [11]. In Figure 5 (left), the trigger efficiency curves are reported. Cyan dots show the efficiency of the Standard Algorithm as a function of p T , red squares the efficiency obtained with the reference benchmark CNN with floating-point weights, and blue triangles the efficiency obtained with the tCNN. The CNN always performs better than the Standard Algorithm (lower efficiency under the threshold and higher efficiency above the threshold). Similar results are obtained with the tCNN, which shows a reduction in the resolution in p T manifested by a slower rise in the efficiency curve around the nominal threshold. The performance of the network in term of the number of muons identified by the tCNN vs the number of true muons reconstructed offline is reported in Figure 5 (right). Each column is normalised to unity. No trigger threshold is applied in calculating the table entries. By requiring a minimum p T of 10 GeV, the numbers off-diagonal are further reduced, in particular, in the case where 0 muons are reconstructed in Figure 4. Schematic view of the network architecture that has been adopted. The numbers beside "Conv 2D" represent the number of filters. The numbers beside "Dense" represent the number of neurons of that layer. the detector, and one muon is reconstructed by the network, the number decrease from 0.6% to 0.01%.

FPGA implementation
The implementation of the tCNN neural network model into the FPGA (Xilinx Virtex Ul-traScale+ XCVU13P) has been accomplished with two sequential phases. First, we have translated the model from the original Python form to a C++ code with a custom made tool 1 . During this phase, several techniques have been adopted in order to improve performance by tuning the trade-off between latency, throughput, and FPGA resource usage. In particular extensive C++ code modularisation has been used in order to reduce FPGA resource usage, while loop pipelining and vector partitioning have been implemented for latency reduction.
In order to reduce the number of parameters of the neural network, the tCNN has been modified to process in parallel a predefined number of portions of the entire image. Passing a  Figure 6. Trigger efficiency as a function of the reconstructed p T for the implemented tCNN in comparison with the optimal tCNN and the Standard Algorithm [11]. smaller input to the tCNN largely reduces the total number of multiplications and therefore, memory occupancy and latency. For the final step, the translation from C++ code into VHDL code, we have used HLS, a tool developed by Xilinx [19]. FPGA latency and resource usage are reported in Table 1 [11].
The Table shows that the implemented tCNN reached the latency goal of 1 µs, and with a limited resource occupation of about 17%, to be compared with about 420 ns latency and 6% resource usage of the Standard Algorithm. However, these results have been obtained for the moment with a tCNN with a smaller number of parameters (about 1/10th of the optimal tCNN) limiting the expressive power and physics performance of the network. This is clearly visible in the efficiency curve reported in Figure 6 [11], where the implemented tCNN (purple triangles) shows a slightly lower plateau efficiency and, more importantly, a less steep turn-on with respect to the optimal tCNN. Nevertheless, the achieved initial performance is very promising, the implemented tCNN has already comparable performance with respect to the Standard Algorithm, and given the reduced resource utilisation there is large space for optimisation of the FPGA code synthesis, a work that is ongoing at the moment of the writing of these proceedings.

Conclusions
ML alternatives to conventional trigger algorithms have been studied for the Phase-II Level-0 muon barrel trigger of the ATLAS detector at the LHC. In particular, it has been shown that a deep neural network-based algorithm can be effectively implemented in the trigger FPGA, within the latency requirements of the ATLAS trigger, and with comparable or better performance with respect the ATLAS Standard Trigger algorithm. Work is ongoing on optimisation strategies and parameter tuning to synthesise the best performing tCNN into the trigger FPGA.