L0TP+: the Upgrade of the NA62 Level-0 Trigger Processor

The L0TP+ initiative is aimed at the upgrade of the FPGA-based Level-0 Trigger Processor (L0TP) of the NA62 experiment at CERN for the post-LS2 data taking, which is expected to happen at 100% of design beam intensity, corresponding to about 3.3 × 1012 protons per pulse on the beryllium target used to produce the kaons beam. Although tests performed at the end of 2018 showed a substantial robustness of the L0TP system also at full beam intensity, there are several reasons to motivate such an upgrade: i) avoid FPGA platform obsolescence, ii) make room for improvements in the firmware design leveraging a more capable FPGA device, iii) add new functionalities, iv) support the ×4 beam intensity increase foreseen in future experiment upgrades. We singled out the Xilinx Virtex UltraScale+ VCU118 development board as the ideal platform for the project. L0TP+ seamless integration into the current NA62 TDAQ system and exact matching of L0TP functionalities represent the main requirements and focus of the project; nevertheless, the final design will include additional features, such as a PCIe RDMA engine to enable processing on CPU and GPU accelerators, and the partial reconfiguration of trigger firmware starting from a high level language description (C/C++). The latter capability is enabled by modern High Level Synthesis (HLS) tools, but to what extent this methodology can be applied to perform complex tasks in the L0 trigger, with its stringent latency requirements and the limits imposed by single FPGA resources, is currently being investigated. As a test case for this scenario we considered the online reconstruction of the RICH detector rings on an HLS generated module, using a dedicated primitives data stream with PM hits IDs. Besides, the chosen platform supports the Virtex Ultrascale+ FPGA wide I/O capabilities, allowing for straightforward integration of primitive streams from additional sub-detectors in order to improve the performance of the trigger. ∗e-mail: alessandro.lonardo@roma1.infn.it © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). EPJ Web of Conferences 245, 01017 (2020) CHEP 2019 https://doi.org/10.1051/epjconf/202024501017


The NA62 Experiment
The NA62 experiment [1] is located at the CERN Super Proton Synchrotron (SPS) accelerator. The ultimate goal of the experiment is the precise measurement of the ultra-rare decay K + → π + νν, predicted in the Standard Model (SM) [2] with a branching ratio of BR(K + → π + νν) = (8.4 ± 1.0) × 10 −11 (1) A high intensity kaon beam is required to collect the statistics needed to reach an accuracy comparable to the theoretical one. Kaons are produced by 400 GeV/c SPS proton beam impinging on a beryllium target. Secondary particles of 75 GeV/c momentum are selected in a non separated beam composed of 6% of kaons, with a total rate of 750 MHz. Only about 10% of the kaons decay in flight along 65 m of decay volume, allowing the detector to collect ∼ 4.5 × 10 12 decays per year. AČerenkov counter (KTAG) identifies the K + and three stations of Si pixel detectors (GTK) trace the beam particles. Annular lead-glass calorimeters (LAV) surround the decay volume for high angle photon detection. Four stations of straw chambers (STRAW) in vacuum are used to measure the momentum and the trajectory of the charged decay products. A RICH counter identifies the charged particles. Plastic scintillators (CHOD and NA48-CHOD) are used for timing and in the trigger chain. A liquid krypton electromagnetic calorimeter (LKr) and small angle calorimeters (IRC and SAC) ensure photon detection in the forward region. Hadron calorimeters (MUV1,2) and a plastic scintillator detector (MUV3) are used to identify muons. Further details can be found in [1]. The NA62 experiment ran throughout 2016, 2017 and 2018. Data collected in 2016 and 2017, running between 40 and 70% of the maximum beam intensity, have been analyzed, achieving the best up-to-date single event sensitivity to the K + → π + νν decay. In total 3 signal-candidate events have been observed among the 2016 [3] and 2017 data, leading to a preliminary upper limit BR(K + → π + νν) < 1.85 × 10 −10 at 90% Confidence Level [4]. Analysis of data collected in 2018 is ongoing, as well as the preparation for the new data taking foreseen starting from 2021, when the experiment will run at full intensity.

The Trigger and Data Acquisition System
A high-performance two-level trigger and data acquisition system (TDAQ) has been developed to quench the initial data rate from ∼ 10 MHz to 100 kHz for permanent recording. A factor 10 in data reduction is obtained with the first level (L0) [5] based on a hardware processing pipeline -the L0 Trigger Processor (L0TP) [6] -implemented using an FPGA programmable logic device. The upper trigger level (L1) is implemented in software on a computer farm. Trigger primitives are produced in the same boards used for data readout [7] by some subdetectors (CHODs, MUV3, RICH, LKr, LAV) and then sent to L0TP via Gigabit Ethernet (1GbE) links, using the UDP data transport protocol. More complex trigger primitives can be produced by means of GPU processing in the early stages of the L0 chain [8]. The L0TP trigger decision, obtained by combining the trigger primitives after time alignment, is dispatched to all TDAQ boards with a maximum latency of 1ms. The L0TP is implemented on the Terasic DE4 development board, which hosts an Altera Stratix-IV FPGA. The L1 algorithms run on the individual detectors data, with a non-fixed total latency during the period of the SPS beam-delivery cycle (called burst, about 5 seconds long).
In table 1 the rate of primitives received by the L0TP is showed. It was measured using a secondary beam intensity between 0 and 600MHz and extrapolated to the nominal value of 750MHz. Burst length was assumed to be 4.3 seconds.

L0TP+: the Upgrade of the NA6Level-0 Trigger Processor
The NA62 L0TP system suffers the shortcomings due to the ten years old technology of the adopted platform. First of all, the Terasic DE4 was discontinued from the market while Intel -that acquired Altera in 2015 -is phasing out the Stratix-IV FPGA libraries. With the outlook of running until 2023, trigger maintenance must be guaranteed; furthermore, porting the design onto a more recent device will allow us to exploit the possibilities offered by the new technology, paving the way for migrating the system in order to follow the increases that are to come in the experiment luminosity. As an example, the current maximum output rate of 1MHz is limited by the bandwidth of the 1GbE link used to send the trigger information to the computer farm. Nevertheless, the DAQ infrastructure is already structured to host a 10GbE connection for this purpose. Another chance for improvements is in the firmware design, since development of the current version was driven by the constraints imposed by the limited Altera Stratix-IV FPGA resources. As a matter of fact, in order to find coincidences between different detectors, L0TP requires one of them as reference (usually it is RICH, due to its intrinsic timing precision), and the triggers are selected comparing the timing of the detectors with this reference input. Thus, the reference detector must be always present in all the trigger masks set by the user. To calculate its efficiency, a second system is used as control and all the primitives issuing from it become triggers. With a larger device, there is the opportunity to rewrite the firmware relaxing the hypothesis of a single reference detector, giving the chance to set different triggers without always including the same reference detector. Moreover, in the new logic design the firmware will exploit a higher speed clock supported by the new FPGA devices. Finally, employing modern FPGAs, new design tools such as High-Level Synthesis (HLS) can be adopted to augment the capabilities of the system.

Selection of the Hardware Platform
For the L0TP platform upgrade we looked up the latest generation FPGAs on the market; offerings from the two major vendors on the scene, Xilinx and Intel, consist of devices which are quite close in terms of architecture and performances (Intel Stratix-10 and Xilinx Ultra-scale+). In order to cut on costs and reduce time-to-solution we focused on Development Kits or off-the-shelf products, with available plug-in modules to adapt the Board to the experiment. We use the High Tech Global FMC10-Port SFP+ (10 G) Module plugged into the FMC+ port to interface the Board to the gigabit Ethernet channels coming from detectors. The setup was validated through extensive Bit Error Rate (BER) tests @10 Gbps performed with the Xilinx IBERT tool at different stages of the Ethernet channels physical media and using different PRBS. The Xilinx 10/25 (10GbE), Xilinx 1-2.5 (GbE) and the UDP offload module inherited from the NaNet design [9] were integrated in the test-bed depicted in figure 1; it is a loopback with a host PC feeding UDP/IP traffic into the VCU118 which sends it back to the PC performing a final integrity check upon the UDP packet payload.  Porting the code onto the new platform, addition of new functionalities and the rewrite of some parts is proceeding in an incremental way. We started replacing Altera Stratix-IV IPs with Xilinx ones, which, given the strictly synchronous operation of the logic, required adapting the system to the new interfaces and latencies. Every change was tested on the Xilinx Vi-

Porting the L0TP Design to the Xilinx VCU118 Platform
On the top left: the Xilinx VCU118 board equipped with the HTG module; on the right:  approach: first of all, there is the need to have a first release without any add-on that exactly replicates the behaviour of L0TP and which can be validated on real experiments during the next data collection; a second reason is building a set of non-regression tests that give the chance to validate any new feature. In Table 2 is shown the occupancy of the ported code on Virtex Ultrascale+ platform. The tiny utilization of memory and logic resources give us the chance of adding useful functionalities to the core logic which are described below.

New Functionalities Introduced with L0TP+
As discussed above, L0TP+ reproduces all L0TP function and adds several capabilities to the original design. A block diagram in Figure 4 shows the main components of the L0TP+ logic, with new modules that have been or are planned to be included in the design depicted in red.

Data Links
The connectivity capabilities of the Xilinx Virtex UltraScale+ XCVU9P FPGA are quite oversized considering the current requirements of the L0TP system. Nevertheless, their usability is limited by the actual number of FMC connectors present on the VCU118 board and market availability of interface cards. With the platform described above, the system is able to support eight 1/10 GbE links through the FMC+ daughercard, while the two QSFP28 ports on the VCU118 can be used either to connect up to eight additional data links from the detectors through breakout cables or to expand the platform capabilities by interconnecting multiple boards via 100 Gbps low latency links.

Microcontroller
A 32-bit MicroBlaze Soft-Core Micro Controller was integrated for debug and configuration purposes; applications can be deployed onto it either bare metal or by Xilinx Petalinux. Shell access is provided by the UART interface (via USB) or through an Ethernet connection. Integration of L0TP+ with the experiment run control system will be via a socket server over the Ethernet interface. Platform configuration and status control is achieved through memory-mapped registers accessed through the AXI system bus.

Stream Processing Module
The growing interest in using FPGAs for compute acceleration has also boosted the interest in High-Level Synthesis (HLS) as a path to more straightforward applications development. HLS is a technique for utilizing programmable logic bypassing traditional hardware description languages such as Verilog or VHDL, directly translating C/C++ functions into logic modules instead. To wait for results from one call before sending the data for the next one is generally not very effective. Translation of the synthesized function into logic elements gives access to a possibly broad number of pipelining stages. In this way, the logic may be allowed to overlap processing of different sets of data at the same time. With the outlook of processing primitive streams from additional subdetectors and thus improving the efficiency of the trigger, we designed a test case to leverage on the new features allowed by L0TP+ with HLS. Being able to receive primitives from the RICH readout boards, we built a deep learning model based on three fully connected layers (by the Tensorflow software platform [10]) and trained it to classify events in four categories: no ring, 1, 2, more than 2 rings. The model was then implemented onto an FPGA via the HLS4ML CERN tool [11] and Vivado HLS. The resulting bitstream can be integrated in the L0TP+ firmware as a Stream Processing Module. In our tests we used a 64 values array for each event as input to the first fully connected layer. Every value is the PM channel number divided by the number of channels in order to have a normalized input and is represented as an 18-bits fixed-point number (six bits before the decimal point). We employed 80000 samples for the training of the model. A very preliminary result we achieved was a prediction accuracy of ∼ 74%, with a low ratio of used FPGA resources: 3.3% of configurable logic block LUTs, 9.7% of DSPs and 0.9% of block RAM. The low occupancy allows for easy integration in the trigger processor core logic, at the same time leaving plenty of resources for the already foreseen improvements of the deep learning model and the implementation of other features such as a PCIe interface.

PCIe Host Interface
We plan to integrate a PCIe core into the L0TP+ FPGA to provide a low latency interface to CPU/GPU on the host: the former can be useful for both control and debug of the board -e.g. to store primitives on the host to get statistics of trigger behavior and performance -, the latter gives the opportunity to capitalize on the NaNet project design [12] to perform GPU-based low level trigger computing. A hardwired GPU-Direct engine in FPGA with low-latency/low-jitter transfers between the data links and the GPU through PCIe will allow implementing a real-time processing pipeline in the low level trigger of the experiment with a coordinated combination of heterogeneous computing devices (CPUs, FPGAs and GPUs).

Conclusions and Future Work
The upgrade of the Level-0 Trigger Processor is currently in an advanced stage of development with the objective of being deployed for the data collection during upcoming NA62 experiment's Run 3 and aiming also to gather information for further experiments at higher intensity. The work presented here displays how we devised a system capable to fully interface with the experiment (detectors, farm, TTC) and to add more flexibilities and resources while guaranteeing the standard L0TP functionalities.The new generation of proposed FPGAbased devices allows to avoid problems with technology obsolescence during the experiment lifetime. Thanks to the increase of FPGA resources, we are able to overcome limitations in the firmware logic and in particular the need to have a fixed reference detector. The larger connectivity provided by the new device will allow to connect more data links from the detectors or to support higher bandwidth links. In addition, the introduction of the HLS paradigm allows to partially reconfigure the trigger and to introduce a Data Stream Processing stage that can be exploited, as in our tests, with a deep learning module designed to process data online.