Heterogeneous techniques for rescaling energy deposits in the CMS Phase-2 endcap calorimeter

We present the porting to heterogeneous architectures of the algorithm used for applying linear transformations of raw energy deposits in the CMS High Granularity Calorimeter (HGCAL). This is the first heterogeneous algorithm to be fully integrated with HGCAL’s reconstruction chain. After introducing the latter and giving a brief description of the structural components of HGCAL relevant for this work, the role of the linear transformations in the calibration is reviewed. The many ways in which parallelization is achieved are described, and the successful validation of the heterogeneous algorithm is covered. Detailed performance measurements are presented, including throughput and execution time for both CPU and GPU algorithms, therefore establishing the corresponding speedup. We finally discuss the interplay between this work and the porting of other algorithms in the existing reconstruction chain, as well as integrating algorithms previously ported but not yet integrated.


Introduction
The operation of the High Luminosity LHC (HL-LHC) is expected to commence in 2027, achieving instantaneous luminosities of ∼5 × 10 34 cm 2 s −1 , more than two times LHC's current value. Over 10 years it will reach an integrated luminosity of ∼3 ab −1 , with potentially up to 200 proton collisions (pileup) per bunch crossing. The goals of the HL-LHC include measuring the Higgs boson (self) couplings, vector boson fusion and vector boson scattering processes (also involving the Higgs boson), and B physics processes, among others [1].
In accordance with this programme, the upgrade of the CMS detector [2] foresees a High Granularity Calorimeter (HGCAL) [3] to replace the current endcap calorimeters. One of the challenges posed to CMS by the new calorimeter is writing reconstruction code allowing its full exploitation.
Present projections show a gap between projected CPU needs and availability at the start of the HL-LHC (Run4), as displayed in Fig. 1. The biggest contributor to CPU usage is event reconstruction (see Fig. 2), of which currently ∼6% is used by HGCAL. CMS plans to port some parts of its reconstruction to Graphics Processing Units (GPUs), which represent one of the most promising accelerator technologies on the market. Its adoption would allow access to accelerators, which become more and more present on High-Performance Computing and traditional grid sites. It would also be in line with the direction taken by CMS to adopt a heterogeneous HLT farm already in Run 3. Finally, it potentially reduces the cost of the  "Gen", "Sim", "DIGI" and "RECO" refer to the tasks employed by CMS to process real and simulated data (in order: generation, simulation, digitization and reconstruction). "reMINIAOD" refers instead to the production of the MINIAOD CMS data format for physics analysis with different CMSSW versions. AOD (Analysis Object Data) and more compact data subsets such as MINIAOD derive from the RECO format. Figure taken from [4].
computing capacity necessary to satisfy the CMS physics programme, since computation on GPUs might be cheaper than on CPUs.
In this paper we describe the first heterogeneous implementation, fully compatible with the CMS software (CMSSW) [5,6], of one of HGCAL's reconstruction stages, namely the rescaling of energy deposits (also loosely called "calibration"). The implementation relies on Nvidia's CUDA [7]. This effort follows the GPU porting of HGCAL's clustering algorithm (CLustering of Energy, CLUE) [8], which for the moment is implemented as a standalone version. The paper is structured as follows. We start by briefly describing HGCAL in Section 2. We cover the basic functionalities of HGCAL's reconstruction chain, and proceed to describe the methodology of the application of linear transformations to energy deposits, respectively in Sections 3 and 4. Finally, we assess the calibrations' validity and performance in Section 5, and draw conclusions in Section 6.

The High Granularity Calorimeter
The HGCAL will be a sampling calorimeter. The proposed design includes, as active material, silicon (Si) sensors in the front 28 layers of its electromagnetic section (CE-E). The hadronic section (CE-H) comprises 8 Si-only layers followed by 14 Si-scintillator hybrid layers, where the scintillation light is read out by Si photo-multipliers (see Fig. 3). The Si sensors are further subdivided into three types with varying thicknesses (120, 200 and 300 µm), capacitances and sizes, to withstand different fluence conditions. The absorbers will be made of CuW, Pb and Cu in the CE-E and stainless steel and Cu in the CE-H, and its thicknesses will vary across layers. The electromagnetic radiation and hadronic interaction lengths of CE-E are 25 X 0 and 1.3 λ respectively, while the hadronic interaction length of CE-H is 8.2 λ. In total, the full HGCAL system has ∼620 m 2 of Si and ∼400 m 2 of plastic scintillators. The size of each Si sensor is 0.5 cm 2 to 1.0 cm 2 (120 µm Si sensors are smaller). Scintillators will range in size from 4 to 30 cm 2 , and the number of Si (scintillator) channels is ∼6 million (∼240 thousand). Each endcap weighs ∼215 t and measures ∼2 m (∼2.3 m) in longitudinal (radial) direction. The full system operates at a temperature of −35 • C maintained by a CO 2 cooling system [3].
Due to HGCAL's hybrid detection technology, three subdetectors are considered independently for both the CPU and GPU implementation of the reconstruction algorithms: • CE-E: comprises the first 28 layers made exclusively of Si; • CE-H Si : covers the Si part of the CE-H section; • CE-H Sci : covers the scintillator part of the CE-H section.
This partition is relevant to the GPU calibration algorithm described in Section 4.

Reconstruction Chain
The current reconstruction model envisioned for HGCAL, part of CMSSW and succinctly depicted in Fig. 4, is intended to be fast and flexible. It comprises a series of modules which transform raw data into physics objects. After the first stages described in [4], one obtains UncalibRecHits. They represent energy deposits whose amplitude is expressed in terms of the average number of minimum ionizing particles (MIPs), after being converted from analog-todigital converter (ADC) counts by the previous Digi step, and taking the sensor thickness into account. This paper covers the following step, i.e., rescaling the hits to produce a CMSSW collection of RecHits (see Section Section 4). Continuing along the chain, the software then clusters the RecHits into two-dimensional layer clusters, using CLUE [8]. Finally, taking the clusters as its input, The Iterative CLustering (TICL) framework [9] produces 3D objects and showers using a mixture of pattern recognition, energy regression and particle identification techniques. In parallel, a heterogeneous way of navigating through geometrical and topological information within the detector (such as information regarding Si sensors or plastic tiles) is being investigated, in order to accelerate and facilitate its access by different algorithms  . HGCAL's reconstruction chain, from raw data to physics objects. The geometry and topology of the detector are intertwined with all modules. The whole chain was already developed for CPU-only algorithms. We highlight in red the stages currently under development for heterogeneous computing, specifically using CUDA. The first three stages are yet to be completely settled upon. This works covers the top right module of the diagram.
in the chain. The constant need to retrieve the x and y coordinates (in HGCAL's transversal plane) in CLUE is an example of these navigation challenges.

Methodology
The energy calibration step consists in converting UncalibRecHits, measured in multiples of MIPs, to energy measurements in multiples of electron volts. In order to measure the energy of a particle shower in a sampling calorimeter with varying absorber thickness, the energy deposited in the sensors (RecHits) must be weighted on a per-layer level. The energy weights, with units MeV/MIP, are calculated with help from Geant4 [10]. Each UncalibRecHit is thus rescaled by one of these weights, depending on the layer in which it is located, and it is further corrected with so-called "regional electromagnetic factors", calculated by comparing the incident energy of a simulated photon with the shower energy measured using rescaled hits. Currently seven of these factors are defined: six for the three Si thicknesses in both CE-E and CE-H, and one more for the scintillator in CE-H. Hits in the CE-E are additionally corrected by thickness-dependent charge collection efficiencies [11].
The code implementing this type of hit-dependent corrections can be straightforwardly converted to a heterogeneous version by assigning one CUDA thread to each hit. There is no need for hits to share information with others, allowing full hit parallelization. Attention is given to systematically allocating memory in multiples of CUDA's warpSize (currently 32 threads), and to access it such that coalesced reads are possible. This is known to increase throughput by potentially an order of magnitude [12].
Further parallelization is made possible due to HGCAL's topology, namely the three subdetectors presented in Section 2. They are considered independently of each other, both in the CPU and GPU implementations, and differ solely in the considered layers and regional correction factors. When using CUDA, this split allows the exploitation of independent CUDA streams, one per subdetector, as shown schematically in Fig. 5. The execution of actions (kernel execution, copies, and transfers, for instance) is guaranteed to be serialized within a CUDA stream. The same streams can also be used for other modules which do not require sharing of information at subdetector-level, such as the ones which produce UncalibRecHits.
The calibration module is developed as an integral part of CMSSW, and takes advantage of some of its CUDA-related functionalities [13,14]. From a computational perspective, as shown in Fig. 6, the calibration starts by converting UncalibRecHits into a data structure more suitable for GPUs, namely a Structure of Arrays (SoA). The latter improves access to global memory in the GPU and allows CPU vectorization to be exploited [15]. The SoAs are then transferred to the GPU, together with some configuration constants, where the calibration is performed. The data will then be directly available on the GPU to be used by the next heterogeneous module, avoiding needless data transfers, which are a common GPU bottleneck [12]. To increase the framework's flexibility and to make its validation possible, the data can also be transferred back to the CPU. In total, each subdetector will have its calibration performed by a maximum of 4 modules (12 in total), event by event: 1. Conversion of the CMSSW input data collection (UncalibRecHits) into a SoA, followed by its transfer to the GPU; 2. Execution of CUDA kernels in the GPU; 3. Transfer the calibrated SoA back to the CPU (optional);

Conversion of the SoA to the RecHits CMSSW collection (optional).
The final goal of the heterogeneous efforts will be to perform the whole HGCAL reconstruction directly within the GPU (Step 2 only), where each module receives GPU data from the previous heterogeneous module, and feeds to the subsequent heterogeneous module the new data it has processed. Ideally, memory transfers to/from the GPU will only be performed at the start and end of the reconstruction chain. As an example, the linking between the calibration and clustering in the GPU is currently under development.
For completeness, we further mention that data-oriented structures (such as SoAs) are being considered to replace the current CMSSW object-oriented legacy format. This would allow the acceleration of CPU code and the removal of the fourth step during GPU-CPU transfers in heterogeneous modules. Figure 5. The rescaling of UncalibRecHits is done independently for each of the three subdetectors in HGCAL. Each of them gets assigned a different CUDA stream. The streams will be propagated and merged in the clustering stage to allow the creation of 3D objects out of 2D clusters. The box having "Early stages" as title refers to all modules which precede the UncalibRecHits rescaling. Figure 6. Diagram showcasing the operations involved in rescaling UncalibRecHits using GPUs, implemented using four independent modules per subdetector. Both the "host-to-device" and "device-tohost" transfers (where device refers to the GPU and host to the CPU) are optional, and can be replaced by a fully-heterogeneous reconstruction chain.

Validation and Performance
Version 12_0_0_pre1 of the CMSSW software [5] is used for both validation and performance measurements.
The validation procedure is straightforward: the CPU-only and GPU calibration modules are run over the same input dataset, starting from three previously produced UncalibRecHits collections, one per subdetector. The obtained RecHits are then compared in terms of their underlying quantities, such as the rescaled energy. The difference is always found to be virtually zero for the three subdetectors (∼10 −10 for CE-E and smaller for CE-H Si and CE-H Sci at 200 pileup). This is true for 0, 50, 100, 140 and 200 pileup. The same pileup values are also considered for performance measurements.
Performance is assessed using a T4 Nvidia GPU and an Intel(R) Xeon(R) Silver 4114 CPU (40 logical cores), together with a benchmarking tool [16] by Patatrack [17]. It measures throughput, defined as the number of events processed per second. We perform all throughput measurements using 4 jobs with 10 CPU threads each and 1100 events, skipping the first 100 to mitigate GPU initialization costs. The events are stored in the CPU cache to avoid the observed negative I/O impact on measured throughput. Fig. 7 shows the measurements for Figure 7. Left: Throughput measurements of the rescaling of energy deposits as a function of pileup, considering 4 jobs with 10 CPU threads each. The blue, orange and green curves represent the throughput of, respectively, everything up to GPU execution (steps 1 and 2 from Section 4), the full GPU chain and the CPU rescaling algorithm. 'H' and 'D' stand for, respectively, host and device. Vertical error bars as given by [16] are present but are too small to be seen. Right: Speedup obtained after dividing the blue and orange distributions on the left by the CPU throughput (green). The comparison is being done between a full node and a single GPU.
the CPU and two different GPU scenarios (left) and the obtained speedups (right), defined as the throughput of a GPU scenario divided by the CPU throughput. We stress that the comparison is being made between a full CMS node and a single GPU. We further observed an average GPU memory usage of ∼50 MiB per CUDA stream.
It is currently not possible to measure the throughput of the UncalibRecHits rescaling alone, since at least the SoA host-to-device transfer must be included in the measurement. However, most conversions and transfers will be made optional once the full reconstruction chain has been deployed into GPUs. In order to estimate a conversion-and transfer-free speedup we use Nvidia's profiler nvprof [18] to extract the average kernel execution time per event. The CPU rescaling algorithm is instead timed with the C++ std::chrono library. Both CPU and GPU time measurements are performed with 1 job and 1 CPU thread. To maximize the occupancy of the GPU and thus expose its benefits over its serial version we consider the CE-E section, where ∼500 K hits are expected per event, while the CE-H Si and CE-H Sci have, respectively, ∼50 K and ∼1 K. The speedup is here defined to be the CPU average time required to process an event, divided by the average time required by the GPU for the same task. Fig. 8 shows a speedup of around three orders of magnitude. By measuring average time for single events, any form of module concurrency is necessarily ignored in the final measurements. We therefore expect the heterogeneous rescaling algorithm to perform better in a real-world scenario (in terms of throughput), where it will have access to multiple CPU jobs and threads, and multiple CUDA streams.
If the number of RecHits is low (as in the case of the CE-H Sci section), GPU-related efforts can potentially be counter-productive, since memory allocations and transfers might actually be slower than the CPU algorithm. We test this by considering the CE-H Sci -related GPU modules only, and by adapting the established CPU calibration to run solely on CE-H Sci hits. We do not observe any significant impact in the throughput. If present, it nevertheless implies an optimization of negligible time intervals, where there is very few data being processed.
Additionally, we again note that the final chain will include multiple modules accessing data directly in the GPU; memory transfers and allocations will therefore not be requested on a per-module basis, significantly reducing their impact on the overall throughput. Moreover, as mentioned in Section 4, we are currently investigating SoA-like formats for data residing on the CPU to decrease or eliminate the time required for conversions between SoA and legacy formats.

Conclusions
This work presents the first HGCAL heterogeneous algorithm to be fully integrated within CMSSW, and the second to be developed (CLUE is for the moment available as a standalone heterogeneous algorithm). It describes the algorithm's implementation, validation and performance measurements. Its motivation is clear: CMS will have to cope with the very significant radiation and pileup increases expected for the HL-LHC. These challenges might be tackled by the widespread adoption of accelerator technologies, GPUs in particular.
Performance measurements indicate that the GPU provides a throughput up to twice as large as the CPU one, including data conversions and transfers which will not be present in the final version of the reconstruction chain. Timing measurements of the rescaling algorithm alone suggest a speedup of three orders of magnitude. Both estimates might be conservative, given that a better result is foreseen as soon as a full GPU reconstruction chain has been developed and is ready for production, taking advantage of multiple CPU jobs, streams, threads and subdetector parallelism. The rescaling, given its speed and relative simplicity, will surely not represent the bottleneck of the chain.
In addition to the speedup, the implementation of the rescaling algorithm paves the way for future accelerator developments in HGCAL. Being essential for allowing the entire reconstruction to be ported to GPUs, it provides a strong motivation for the development of the CMSSW heterogeneous rescaling-clustering linking and grants the opportunity to assess CLUE's heterogeneous performance. It further facilitates the deployment of the calibration to multiple accelerators using abstraction libraries such as alpaka [19]. Finally, GPU computing has the added benefit of being potentially cheaper than its CPU counterpart.