Jet Single Shot Detection

We apply object detection techniques based on Convolutional Neural Networks to jet reconstruction and identification at the CERN Large Hadron Collider. In particular, we focus on CaloJet reconstruction, representing each event as an image composed of calorimeter cells and using a Single Shot Detection network, called Jet-SSD. The model performs simultaneous localization and classification and additional regression tasks to measure jet features. We investigate Ternary Weight Networks with weights constrained to {-1, 0, 1} times a layer- and channel-dependent scaling factors. We show that the quantized version of the network closely matches the performance of its full-precision equivalent.


Introduction
The majority of particles produced at the CERN Large Hadron Collider (LHC) are unstable and immediately decay in different particles. When quarks and gluons are produced, QCD confinement prevents them from travelling across the detector. Instead, they shower other quarks and gluons, eventually hadronizing into particles. The result of this process is a jet, a collimated showers of particles with adjacent trajectories. Jets are key in many physics analyses done on the data collected by the LHC experiments, e.g. [1][2][3][4]. The procedure of classifying the origin of these jets, i.e. the nature of the particle that initiated the shower, known as jet tagging [5][6][7][8] is a fundamental task for collision reconstruction at the LHC. Similarly, it is important to determine the jet energy, momentum, and mass.
The goal of this paper is to extend this approach to the problem of jet clustering, e.g., to replace FastJet [32] on computing architectures where parallel computing is more adequate. At the same time, we aim at demonstrating that jet clustering, mass measurement, and tagging could all be handled simultaneously. Besides the practical advantages of a single-shot approach to jet reconstruction, one would benefit from mutual learning when accomplishing more tasks at once. For instance, a classifier and a regression running at once can learn that calibration constants depend on the nature of the jet, an issue that is not handled with ad-hoc post-processing (see [33] as an example).
With the luminosity increase expected in the future, traditional reconstruction algorithms might suffer from execution time scaling worse than linearly with the number of collisions happening in one bunch crossing. For this reason, it is worth investigating solutions that could execute many tasks at once, while retaining accuracy and benefiting from the additional speed up offered by parallel computing architectures. Deep neural networks, such as those used for computing vision tasks, are an obvious candidate.
On the other hand, memory consumption is also an important aspect to keep under control. To this purpose, we investigate the use of extreme quantization, up to ternary precision, which is applied already at training time to retain accuracy.
The remainder of this paper is structured as follows. In Sections 2 and 3 we briefly review single-shot detection and efficient model design techniques. In Section 4 we introduce the dataset and in Section 5 model architecture, implementation details and training procedure. Finally, in Section 6 we present the evaluation metric and results.

Single-shot object detection
Object detection is a fundamental task in computer vision. It is defined as the classification of objects from predefined categories in the image along with their precise spatial locations. The spatial location and extent of an object can be defined coarsely using a bounding box, which is an axis-aligned rectangle tightly bounding the object. Instead, a precise pixel-wise segmentation mask corresponds to the segmentation task.
Starting from Overfeat Network [34], the field of object detection focused on using primarily CNNs as a building block, achieving state-of-the-art results in tasks such as face [35] or pedestrian detection [36]. For a general survey on this subject, see [37,38].
The Single Shot Mulibox Detector (SSD) [49], shown in Figure 1, is a simple one-stage, anchor-based detector. First, a set of default regions in an image with a fixed shape and size is predefined to discretize the output space of bounding boxes, called anchors. These anchors have a diverse set of shapes to detect objects with different dimensions, i.e multiple scales and aspect ratios. At each location, the same amount of anchors is defined. Based on the ground truth, the object locations are matched with the most appropriate anchors to obtain the supervision signal for the anchor estimation.
During training, each anchor is refined by four box coordinates offsets (width, height, x and y) optimized by localization loss (a smooth L1 loss) and predict the categorical probabilities (including background), optimized by classification loss (categorical cross-entropy). To avoid a huge number of negative proposals dominating training gradients, hard negative mining is used to train the network, which fixes the foreground and background ratio.
The SSD architecture is fully convolutional, with initial layers based on a pre-trained backbone architecture, such as VGG-16 [50], followed by extra convolutional layers, progressively decreasing in size. The information in the last layer may be too coarse spatially to allow precise localization and at the same time, detecting large objects in shallow layers is non-optimal without large enough receptive fields. SSD performs detection over multiple scales by operating on multiple feature maps, i.e. at different depths of the network. Each of these feature maps is responsible for detecting objects according to their receptive field.
The final prediction is made by merging all detection results from different feature maps followed by a non-maximum suppression (NMS) step to produce the final detection. NMS removes duplicate predictions originating from multiple anchors.

Efficient inference
Network compression [51] is a common technique to reduce the number of operations, model size, energy consumption, and over-training of deep neural networks. As neural network synapses and neurons can be redundant, compression techniques attempt to reduce the total number of them, effectively reducing multipliers. Several approaches have been successfully deployed without much loss in accuracy, including parameter pruning [52][53][54] (selective removal of parameters based on a particular ranking and regularization), low-rank factorisation [55][56][57] (using matrix decomposition to estimate informative parameters), compact network architectures [58][59][60][61], and knowledge distillation [62] (training a compact network with distilled knowledge of a large network).
A particularly successful compression technique is weight quantization [63][64][65][66][67][68][69][70][71], which is reducing the precision of operations and operands. It has been observed that 32-bit floatingpoint calculations or full-precision (FP) are not needed at inference to achieve optimal performance. Thus, reducing the precision of the calculations, i.e. weights and biases, has little impact on performance compared to speed up and resource usage. This includes moving away from floating point to fixed point, reducing bit-width and weight sharing. An example of a very aggressive strategy is reducing weight precision to ternary values restricted to {−1, 0, 1} only, called Ternary Weight Network (TWN) [68]. The quantization is performed during training, using a straight-through estimator [63], where ternary weights are used during the forward and backward propagation but not during the parameters update. To make the network perform well, TWNs minimize the Euclidian distance between full precision weights and the ternary ones with the use of a non-negative layer-and channel-dependent scaling factor α.

Dataset
The CERN LHC experiments implement a real-time selection process, called trigger [72], to store a fraction of the events for further analysis. Jets are useful for many measurements and physics searches. A truly minimal approach to perform identification and tagging is with jet images. Generally, jets need a component of tracks as well to be properly reconstructed.
However, one could reconstruct the calorimeter part alone (known as CaloJet). The energy measurements of the emanating particles can be projected onto a cylindrical detector and represented as images by unfolding the inner surface of the calorimeter on a rectangle, and using the crystals as pixels, as in [73].
The detector effects and hadronization have an important effect on the jet substructure. In this work, we use an emulation of the Compact Muon Solenoid (CMS) apparatus as a reference. There are two calorimeters within the solenoid volume of the CMS detector. A lead tungstate crystal Electromagnetic Calorimeter (ECAL) is designed to stop particles whose main interaction is electromagnetic (photons, electrons). A brass and scintillator Hadronic Calorimeter (HCAL) is designed to stop hadrons. They give a measurement of the energy of particles (charged and neutrals). Each of them is composed of a barrel and two endcap sections. Forward calorimeters extend the pseudorapidity range (η) coverage provided by the barrel (η ≤ 1.4) and endcap detectors (1.4 < |η| ≤ 3.0). A more detailed description of the CMS detector, together with a definition of the coordinate system used and the relevant kinematic variables, can be found in [74].
This study aims at identifying different kinds of jets. To this purpose, we consider 13 TeV proton-proton collision events, in which RS gravitons decay to bb, HH, WW, ZZ, or tt final states. Events are generated with Pythia [75] and the CMS detector effects are emulated using the Delphes [76] library. In addition to the hard collision, parasitic pileup collisions are also simulated, overlapping minimum bias events. The number of pileup collisions is sampled from a Poisson distribution. The calorimeter cells (towers) in the barrel region are arranged in a fixed discrete space with fine segmentation in η, φ, where φ is the translated azimuthal angle. The final image is formed by translating the calorimeter energy deposits into pixels, which results in a 340 × 360 pixel image. The intensity of each pixel is proportional to the sum of the energy of the corresponding cell. The previous studies on jet images implemented data pre-processing steps such as translation, rotation, re-pixelation, or inversion. However, in our study we only limit the input to barrel and endcap section, η ∈ (−3, 3), and normalize pixel intensities to a fixed range <0, 1>, using maximum scaling. The ground truth labels for jets above threshold momentum (30 GeV/c for b and 200 GeV/c for the jets from boosted heavy particles) are obtained using a simple cone algorithm, i.e. associating together particles whose trajectories lie within a circle of radius R = 0.4 from the jet centre.
As a proof of concept, we investigate the tagging of the bottom (b) W boson (W), Higgs boson (H), or top quark (t) jet. An example input, energy deposits translated to twodimensional images with two channels (corresponding to ECAL and HCAL) together with marked ground truth bounding boxes is shown in Figure 2.

Model, implementation and training procedure
The Jet-SSD architecture is shown in Figure 3. Several modifications are applied to the original architecture [49]. Due to target hardware constraints, all filters in convolution layers are of size 3×3 with no dilatation and all pooling layers have 2×2 filters. Each convolution block is followed by batch normalization [77,78] and parametric rectified linear unit (PReLU) layers. To compress the model we use half of the channels of the VGG-16 in each layer. We also remove bias from all convolution layers. The extra layers proposed by the original paper do not contribute to accurate detection due to the size of jets and thus they are removed at the training. Retaining the deeper layers in the base network does not show improvements in the final detection results either, but they are critical during training due to additional signal during back-propagation. Hence, we only purge them at inference.
The Jet-SSD network is implemented on an NVidia Tesla GPU using PyTorch [79]. For training, we use stochastic gradient descent with an initial learning rate of 10 −3 with momen-  tum set to 0.9 and weight regularization to 0.0005. We train the network for 100 epochs with a batch size of 25, decreasing the learning rate by a factor of 2 after 20, 30, 50, 60, 70, 80 and 90 epochs. We use 90k and 30k samples for training and validation, respectively. The training is performed in mixed-precision to speed up computation and distributed across 3 GPUs.
The full precision network (FPN) is trained from scratch using Xavier uniform initialization [80] (which helps with the sparsity of the input) as the pre-trained classification models on the real-world ImageNet [81] dataset have little relation to our calorimeter images. A common challenge when training models from scratch is the insufficient amount of training data which may lead to overfitting. However, it is not a problem in our case: the training dataset is large enough and, if overfitting occurred, we can go back and generate an even larger one. For TWN training we find out that pre-loading trained FPN weights greatly speeds up the process. And per-layer and per-channel scaling factor α improves the results.
The final detection layer returns a classification label (background, b, W/H or t jet) and three regression values. Two of them correspond to the centre of the jet, i.e. offset in η and φ plane from the anchor. The last one is jet mass regression which is an example of an auxiliary function that Jet-SSD can be tasked with.

Results
An example of the Jet-SSD in action is shown in Figure 4. Jet-SSD outputs predicted categorical label of the object, confidence and bounding boxes. In object detection true positive is defined as prediction with category equal to the ground truth label and Intersection over Union (IoU) above the predefined threshold, in our case 0.5. Successful prediction meets both criteria, otherwise, it is considered as a false negative.
To evaluate the model we use precision and recall (true positive rate), and average precision (AP) metric, which is computed for each category separately. Classification tasks usually report on the receiver operator characteristic (ROC) curve, which is a function of the false positive rate (fall-out or the background efficiency) as a function of the true positive rate (sensitivity or signal efficiency). In the case of object detection, the false positive rate is not very informative as there is a big imbalance between positive and negative class (there are no objects in most locations). Thus, the false positive rate is replaced by precision or positive predictive value (PPV). Intuitively, precision measures how accurate the predictions while recall measures the quality of the positive predictions. To draw a precision-recall (PR) curve, the predictions are first sorted in order of confidence followed by calculation of posi- tive predictive value and true positive rate for each confidence threshold. For the relationship between ROC and PR curve, see [82].
The PR curve of Jet-SSD, evaluated on a held-out test dataset consisting of 90k samples, is shown in Figure 5. The TWN results are closely matching the results of the FPN, which is reflected in an AP score. To calculate the value of AP, the maximum precision is calculated for the recall values that range from 0 to 1 with a step size of 0.1 and finally averaging over the results. From the PR curve, we can conclude that t jets are the easiest to identify while b jets detection is lacking. The result is not surprising for two reasons. Firstly, b jets have a lower momentum threshold, making the energy deposits more challenging to detect. Secondly, CNN based object detection is more challenging as the scale of the target object decreases; and b jets have a smaller radius than t, W and H jets. The latter issue can be further mitigated as small scale object detection is an active research field in machine learning (for example [83]).
Finally, we report the mean and median localization error in φ and η and the relative error in mass regression. These results are shown in Figure 6. The φ localization error is smaller than η due to input information loss. Remind that we limit input in η dimension. In the case when the jet centre is close to the edge, i.e. |η| ≈ 3.0, part of the information is lost beyond image boundaries. Due to the cylindrical structure of the detector, this is not happening in φ dimension. Furthermore, we notice that the error does not decrease with p T for η for which we don not find a reason. Finally, the mass regression relative error can be further decreased with re-balancing of the SSD training loss, i.e. increasing regression error contribution to back-propagation by introducing a new scaling hyper-parameter β: loss = classi f ication + localization + β × auxiliary, where β > 1.

Conclusions
In this paper, we introduced Jet-SSD, a deep learning network able to simultaneously localize, tag and estimate the mass of jets, a collimated spray of particles produced in high energy physics experiments. We showed that the compressed model via quantized weights to ternary values with layer-and channel-dependent scaling factor closely matches the performance of the full precision model. We seek to examine the performance of the network on dedicated hardware.