A real-time FPGA-based cluster finding algorithm for LHCb silicon pixel detector

Starting from the next LHC run, the upgraded LHCb High Level Trigger will process events at the full LHC collision rate (averaging 30 MHz). This challenging goal, tackled using a large and heterogeneous computing farm, can be eased addressing lowest-level, more repetitive tasks at the earliest stages of the data acquisition chain. FPGA devices are very well-suited to perform with a high degree of parallelism and efficiency certain computations, that would be significantly demanding if performed on general-purpose architectures. A particularly time-demanding task is the cluster-finding process, due to the 2D pixel geometry of the new LHCb pixel detector. We describe here a custom highly parallel FPGA-based clustering algorithm and its firmware implementation. The algorithm implementation has shown excellent reconstruction quality during qualification tests, while requiring a modest amount of hardware resources. Therefore it can run in the LHCb FPGA readout cards in real time, during data taking at 30 MHz, representing a promising alternative solution to more common CPU-based algorithms.


Introduction
The LHCb detector [1,2] is a single-arm forward spectrometer, designed for precision studies of b-and c-hadrons produced in pp collisions. During Run 1 and Run 2, the LHCb detector has shown excellent performance, both in terms of data quality and track reconstruction and particle identification efficiencies. However, one of the main limitations of the current detector is the maximum readout rate (1.1 MHz) of most sub-detectors, constraining trigger efficiencies, particularly in hadronic channels.
To overcome these limitations the LHCb experiment is undergoing an extensive upgrade in view of the upcoming third run of the LHC [3]. Several sub-detectors, including the silicon pixel vertex detector, have been completely redesigned to cope with a peak luminosity L = 2 × 10 33 cm −2 s −1 . A software High Level Trigger (HLT) capable of processing the full inelastic collision rate of 30 MHz is being implemented, improving trigger decision and maximizing signal efficiencies. The upgraded LHCb data acquisition framework will challenge the whole data handling system due to the large amount of data that has to be processed. In this respect, a common effort is being made to address heavily repetitive tasks at early DAQ stages, leaving to CPUs only the more complex ones. An example of such tasks is the clustering of active pixels in the silicon vertex detector. Grouping contiguous pixels in single hits is both time demanding, due to the 2D pixel geometry, and a highly parallelizable process.
We have developed, implemented and characterized a clustering algorithm that can run on back-end FPGA-based DAQ cards during the detector readout [4,5]. The features of this algorithm are based on a design developed within the INFN-RETINA R&D project [6].

Clustering in LHCb pixel detector
The structure of the clustering algorithm is applicable to a general pixel detector, but it has specific features that were tailored for the LHCb Vertex Locator detector (VELO) [7]. VELO detects charged particle in the region closest to the interaction point, aiming at reconstructing primary and secondary vertexes with a spatial resolution smaller than typical decay lengths of b-and c-hadrons in LHCb (cτ ∼ 0.01 -1 cm), in order to discriminate between them.
The new VELO, based on silicon pixel technology, will consist of 52 modules positioned along the beam axis, both upstream and downstream of the nominal interaction point. Fig. 1 shows the sub-structure of a VELO layer: a module consists of four sensors, three chips each. A particle crossing a VELO module usually activates more than one pixel. In order to  VELO data are formatted as 4×2 pixel blocks, named SuperPixels (SPs). SPs are sorted in two categories, according to the presence of any active neighboring SP: a SP is flagged as 'isolated' if none of its eight SP neighbors has any active pixel. This information helps in optimizing the performance of the cluster reconstruction process that follows, allowing a different, faster algorithm for isolated SPs.

A FPGA-friendly clustering algorithm
Clusters produced by particles hitting the VELO detector typically consist of just few pixels (1-4 pixels in 96% of cases) as shown in Figure 2. For this reason, a significant fraction of the clusters are isolated, making it convenient to reconstruct them separately with a lookup table (LUT). The LUT is loaded with pre-calculated addresses, linking each of the 256 SP configurations to the cluster coordinates. In this way, reconstructing clusters contained in a single SP requires a very small amount of FPGA resources and is very fast.  Finding clusters from not isolated SPs requires a more structured approach, involving multiple steps. For each event, all SPs coming from the same VELO sensor fill a set of matrices, as shown in Fig. 3. Each matrix can contain up to 9 SPs, in three rows and three columns and it does not map to a specific VELO region until it is initialized. As a SP arrives to an uninitialized matrix, it fills the matrix in the center, calculating the coordinates of the neighbouring SPs. Further SPs input to the matrix are compared with the previously calculated coordinates. In case of a match, the pixels status is used to fill the right position in the matrix, otherwise the SPs are passed on to the next matrix in the chain. At the end of each event, in a fully parallel way, each pixel checks if it belongs to one of the patterns shown in Fig. 4.  . Pixel patterns seeding to a cluster candidate. Patterns are optimized for sensor mounting orientation. See [5] for further details.
The corresponding 3 × 3 cluster candidate is then resolved by a LUT. The absolute cluster position is then obtained as a vector sum of the matrix position with respect to the detector, the checking-pixel position with respect to the matrix and cluster position with respect to the checking pixel. The algorithm has three main parameters that can be optimized. The matrix shape and size are determined by how SPs with neighbors are arranged together, the distribution of the number of SPs with neighbors per event establishes the number of matrices that has to be instantiated. For the VELO clustering algorithm it has been decided to implement 20 matrices for each VELO sensor. The size of the cluster candidates is determined by the distribution of cluster sizes shown in Fig. 2.

Reconstruction quality
In the FPGA implementation of clustering algorithm, cluster candidates are limited to a 3 × 3 pixel mask. In case of big clusters only a subset of pixels is used in determining the cluster position. Although such clusters are uncommon, clustering and tracking reconstruction quality has been studied to ensure that are not degraded, when FPGA clusters are used. For this purpose, a bit-level simulation of the FPGA clustering algorithm has been implemented and integrated in the official LHCb simulation environment. The HLT tracking is fed with FPGA clusters and its output is compared with that obtained with the standard CPU-based clustering code. The CPU-FPGA comparison has been performed on a 50k minimum-bias simulated event sample, at center of mass energy √ s = 14 TeV and luminosity L = 2 × 10 33 cm −2 s −1 (Run 3 upgrade conditions).  Cluster reconstruction efficiency is defined as the fraction between the number of hits on the detector found by clusters and the number of reconstructible hits. A hit is called reconstructible if the particle generating it has left enough charge in the detector to light up at least one pixel. The overall FPGA cluster inefficiency is below 0.1% within the LHCb geometrical acceptance (2 < η < 5).  The quality of the reconstructed clusters is studied using cluster residuals, defined as the distance between the cluster center and the position of the particle associated to it, within the detector. Fig. 5 (right) shows a comparison between CPU and FPGA cluster residual distributions. Differences at the per mille level are observed between CPU and FPGA clustering algorithms for VELO and long track types. These differences have been studied as a function of several kinematic variables. Fig. 6 shows VELO tracking efficiency for long non-electron tracks, matched to a true simulated particle, as a function of the particle momentum, using CPU and FPGA clusters, with a magnified vertical scale to highlight the differences between algorithms. No significant difference is observed.  Figure 6. VELO tracking efficiency for long non-electron tracks, matched to a true simulated particle, as a function of the particle momentum, comparing CPU and FPGA clustering algorithms. The blue histogram shows the distribution of the particles in momentum. The vertical scale is magnified to highlight the differences between algorithms. Data are 50k minimum-bias simulated events.

Firmware implementation and hardware testing
The FPGA clustering firmware, available in the public code repository [9], is written in VHDL language, in order to fully exploit the FPGA potential in terms of parallelization, timings, and resources usage. The firmware has a modular structure, where each unit serves a precise purpose [5]. Fig. 7 shows the input-output interfaces, the main components and their connections. Starting from the input side (left side of Fig. 7 [11]) in terms of amount of logics and memory. The clustering firmware requires roughly 26% of logics and 10% of memory of an Intel ® Arria ® 10 chip to process an entire VELO module.
In order to run clustering as a real time process, the firmware has to sustain a 30 MHz event processing rate, to sustain the LHC average bunch crossing rate. The system runs comfortably without errors at a clock frequency of 350 MHz (out of a 650 MHz nominal maximum for our chip model), providing a measured event rate of 38.9 MHz, as shown in Fig. 8, amply sufficient to sustain the target rate of 30 MHz readout. The firmware, completed with all necessary ancillary logic, has been integrated in the VELO readout firmware as a self-contained block at the end of the processing chain; its output is transmitted out of the readout card via PCIe interface.
A total of 52 Intel ® Arria ® 10 boards are needed to reconstruct the entire VELO, one board for each module. Even if clustering data from a single VELO module does not require all the FPGA resources available in an Intel ® Arria ® 10 chip, other operations need to be performed beforehand. Those involve timing-alignment of SuperPixels and SuperPixel flagging tasks [12] that add up to the total amount of resources needed. The resources needed for the entire firmware, from receiving SuperPixels from the detector to cluster reconstruction, are within the FPGA limits so no extra hardware is needed.

Throughput and bandwidth gains
VELO tracking, including cluster reconstruction, is the most time consuming task of the first stage of the high level trigger (HLT1). It takes about 48% of the HLT1 processing time [13]. Running the HLT1 reconstruction on CPUs with and without the FPGA clustering algorithm shows a gain in the event rate throughput of about 8%. LHCb has recently decided to run the full HLT1 reconstruction on a GPU-based architecture starting from the imminent LHC Run 3 [14]. The GPU-based HLT1 throughput increases by a factor of about 4% offloading VELO clustering to FPGAs. Furthermore, running clustering at early DAQ stages reduces the VELO detector bandwidth [15]. To quantify the reduction, the average number of SPs per event is compared to the corresponding number of reconstructed clusters, leading to a data size reduction of around 15%.