Conditional Wasserstein Generative Adversarial Networks for Fast Detector Simulation

Detector simulation in high energy physics experiments is a key yet computationally expensive step in the event simulation process. There has been much recent interest in using deep generative models as a faster alternative to the full Monte Carlo simulation process in situations in which the utmost accuracy is not necessary. In this work we investigate the use of conditional Wasserstein Generative Adversarial Networks to simulate both hadronization and the detector response to jets. Our model takes the 4-momenta of jets formed from partons post-showering and pre-hadronization as inputs and predicts the 4-momenta of the corresponding reconstructed jet. Our model is trained on fully simulated tt events using the publicly available GEANT-based simulation of the CMS Collaboration. We demonstrate that the model produces accurate conditional reconstructed jet transverse momentum (pT ) distributions over a wide range of pT for the input parton jet. Our model takes only a fraction of the time necessary for conventional detector simulation methods, running on a CPU in less than a millisecond per event.


Introduction
Comparing theoretical predictions with experimental results in high energy particle collisions such as those produced at the LHC is a challenging problem. In addition to the complicated calculations that go into predicting the final states of the collisions, it is necessary to either simulate the response of the detector to the final state particles and apply event reconstruction algorithms to the simulated data, or go in the inverse direction and unfold the detector effects from experimental results. When the first approach is taken and high accuracy is needed, studies use Monte Carlo (MC)-based detector simulators such as GEANT4 [1]. Unfortunately, this accuracy comes at a significant computational cost, and processing a single LHC event can take on the order of minutes. Several publicly available fast simulators such as Delphes [2] have been introduced as a quicker, lower fidelity, alternatives to the full MC simulations. Such fast simulators parameterize the detector response function R D , and then randomly sample from R D for each particle in the event.
In this work, we take advantage of recent advances in deep generative modeling to learn R D using conditional Wasserstein Generative Adversarial Networks (cWGAN). We focus on the simulation of jets, collimated clusters of particles arising from the hadronization of quarks and gluons. There now exists a significant body of work on many aspects of jet simulation. In particular, many studies have simulated the calorimeter response using Generative Adversarial Networks (GANs) [3][4][5], auto-regressive models [6], autoencoders [7], and graph neural networks [8]. Other works have used GANs to produce jet images [9] and predict kinematic observables of dijet events [10]. Our work differs from previous studies in that we condition our model on jets formed at the parton level, post-parton showering and pre-hadronization, and the model directly predicts the 4-momenta of reconstructed jets. This choice of input to the model seeks to increase the speed at which the full event can be simulated as the chosen event generator would stop before hadronization. We refer to this approach and project as Falcon.
The paper is organized as follows. We first describe details of the simulated data we have used and follow with an overview of cWGANs. Details of our model are presented next, followed by our results and conclusions.

Simulated Data
In order to train our model, we generated roughly 168, 000 tt events at 13 TeV without pileup using a standard CMS Open Data [11,12] workflow in the publicly available software library CMSSW [13] of the CMS Collaboration. Event generation was performed with Pythia 8 [14] and the simulation of the CMS detector was done with GEANT4. Events were reconstructed using the Particle Flow (PF) reconstruction algorithm [15] and reconstructed jets were clustered from PF candidates using the anti-k t algorithm [16] with distance parameter R = 0.4.
Since our goal is to map directly from the parton jets to reconstructed jets, an algorithm is needed to identify the appropriate set of partons prior to hadronization. These partons were identified as follows. We start with an empty set of partons. Then, for each stable generated particle, we recursively traverse through each mother particle, checking if the particle encountered was a parton. Each unique parton was added to the set. The set of partons so obtained were then clustered into jets using the anti-k t algorithm implemented in the FastJet software package [17] again with distance parameter R = 0.4. In each event, a parton jet and a reconstructed jet were matched if ∆R = ∆η 2 + ∆φ 2 < 0.35, where ∆η is the difference in pseudo-rapidity between the two jets and ∆φ is the difference in azimuthal angle. Only parton jets with p T > 20 GeV were considered for matching. This procedure resulted in 928, 991 parton jet-reconstructed jet pairs that were used to train the model.

Conditional Wasserstein Generative Adversarial Networks
The goal of our machine learning model is to generate samples x ∼ p(x|y), where x is the 4-momentum of a reconstructed jet given the 4-momentum of the associated parton jet y from the event generator. We use a cWGAN model with two extensions to the original Generative Adversarial Network (GAN) architecture proposed in [18]. A standard GAN consists of two neural networks: a generator G samples from a noise distribution p z (z) and produces an output x according to the generated distribution p g (x), while a discriminator D takes in samples from the generated distribution as well as samples from the true distribution p r (x) and determines from which distribution a given sample originated. During training D "learns" to differentiate between the generated distributions, while G learns to "fool" D by producing samples that are as realistic as possible. Given a perfect discriminator, in [18] it is shown that training a GAN is equivalent to minimizing the Jensen-Shannon divergence between p g and p r [18].
The first extension, the Wasserstein GAN introduced in [19], instead minimizes the Wasserstein distance between p g and p r , given by where the generator is parameterized by θ, the discriminator is parameterized by w, and W is the set of parameters such that D is a K−Lipschitz function for some K. In practice, a Wasserstein GAN is trained by alternating between several iterations of gradient ascent on D using the following loss function in order to estimate the Wasserstein loss, followed by a step of gradient descent using G. In order to enforce the Lipschitz constraint, we used an additional "gradient penalty" term in the loss function as described in [20]. The discriminator in a Wasserstein GAN is termed the "critic" to emphasize that it no longer serves as a classifier.
The second extension is that of a conditional GAN [21]. A conditional GAN seeks to learn a conditional distribution p r (x|y), where y is some additional information such as a class label, or in our case, the 4-momentum of a parton jet. To do so, the generator accepts a noise vector z and the additional information y and produces a sample x. When trained with the Wasserstein distance, the critic takes (x, y) pairs either from the generator or from the true distribution. This results in the loss function where the last term is the gradient penalty. To calculate the gradient penalty, the gradient of the critic is calculated with respect to inputsx andŷ, which are convex combinations of data points from the true and generated distributions. The coefficient λ controls how strongly the gradient penalty is enforced.

Implementation Details
The cWGAN was implemented using the TensorFlow package [22] and the Keras interface [23]. All of the code used for training the model is available in a public GitHub repository [24]. Before the data were provided to the model, the features p T and energy E were scaled by taking the base-10 logarithm and then all features were normalized to have a mean of zero and unit variance. Both the generator and critic are fully connected neural networks. A diagram showing the layers, nodes, and activation functions of each network is shown in figure 1.
During training, batches are formed by randomly sampling 64 elements from the dataset of matched parton jet and reconstructed jet 4-momenta, as well as sampling 64 "noise vectors" of length 10 from a uniform distribution over the interval [0, 1]. In a generator update step, the parton jet 4-momenta and noise vectors are passed into the generator, which then produces a batch of generated reconstructed jet 4-momenta. These generated reconstructed jet 4-momenta and their associated parton jet 4-momenta are passed into the critic, and the output of the critic is used to calculate the second term of equation 2, as it is the only term that depends on the generator. The gradient update is then performed using the RMSProp algorithm [25].  For the discriminator update step, two batches of size 64 are sampled from the jet dataset, as well as a batch of noise vectors. The noise vectors, and one of the batches of parton jet 4-momenta are passed into the generator to create a batch of data from the generated distribution. Convex combinations are formed from the generated batch and the real batch, and then the real batch, generated batch, and combination batch are passed into the critic in order to calculate the loss as given in equation 2. The gradient update is performed using the RMSProp algorithm as well. A diagram of the flow of data inside of the cWGAN is shown in figure 2. We used a ratio of five critic updates for every generator update, and used a learning rate of 5e − 5 for the critic and 1e − 5 for the generator. We set the gradient penalty coefficient λ to 10.

Results
In this section we refer to the reconstructed jets obtained from the GEANT4 simulation as the "true reco" jets, and the jets from the cWGAN as the "predicted reco" jets.

Timing
In this section we present results on the time taken to simulate events using our machine learning approach, and compare them with the time taken using Delphes as an example of a fast detector simulator. All times were generated using a Linux CentOS 8 machine with thirty-two Intel Xeon E5-2620 v4 CPUs running at 2.1 GHz. We first investigate the amount of time saved by skipping the hadronization step. Table 1 shows the amount of time taken to generate events in Pythia 8 with and without the hadronization step, at a center of mass energy of 13 TeV with the hard subprocess of gg → tt. As seen in the table, once Pythia 8 is being used to generate thousands of events, it appears that hadronization takes around 5-15% of the total event generation time. We next compare the speed of our machine learning model and Delphes. To get times for the cWGAN, we generated events in Pythia 8 without hadronization, clustered the partons in the event into jets using the anti-k t algorithm as implemented in FastJet with distance parameter R = 0.4, and then passed the jets into the cWGAN. We called our TensorFlow model directly in our C++ code using the CppFlow library [26]. To get the times for Delphes, we ran Pythia 8 inside Delphes using the "DelphesPythia8" executable offered as part of the Delphes software.

Number of Events
As a first comparison, Table 2 shows the amount of time taken to do the entire event simulation process using either the cWGAN or Delphes as the detector simulator. We observed that using the cWGAN reduced the entire event simulation time by over half.
We next compared the amount of time taken by both the cWGAN and Delphes in addition to the time taken by the event generator. To do this, we took the times listed in Table  2 and subtracted the amount of time taken to generate the events through hadronization with Pythia 8, cluster the stable particles into jets using the anti-k t algorithm as implemented in FastJet with distance parameter R = 0.4, and write the jet four-momenta to disk. This comparison is shown in Table 3. When compared in this fashion, we observed that the cWGAN was over an order of magnitude faster than Delphes.  Table 3. Comparison of the CPU time used after using Pythia8 to simulate events using a cWGAN and Delphes.

Momenta Predictions
For parton jet 4-momenta y and reco jet 4-momenta x, the matching procedure creates a joint distribution of 4-momenta p(x, y). Figure 3 shows the marginal distribution p r (x) = p r (x, y)dy from the training dataset and the marginal distribution predicted by the model p g (x) = p g (x, y)dy. As seen in figure 3, there are certain characteristics of the distributions the model does not capture, such as the dip at η ≈ ±3 (a result of the CMS detector geometry). However, the model does learn the overall shapes of the marginal distributions. Figure 4 shows the full joint distribution p(x, y) of p T for matched parton and reconstructed jets, where the counts of each histogram bin are log-scaled. It is apparent that the model is able to learn many important features of the joint distribution, including non-Gaussian effects for low p T jets.
The true goal of the model is to sample from the conditional distribution x ∼ p g (x|y). To evaluate the model's ability to produce accurate conditional distributions, we set p T bins around a central value a, and then took every reco jet that matched a parton jet with p T ∈ [a − δ, a + δ] for a given bin width δ. As we make δ smaller, we approach the conditional distribution for y = a. Since we are approximating the density by drawing samples from the joint distribution, we are limited in our ability to make δ small by the size of our dataset. Several examples of such "conditional" distributions are shown in figure 5. The model shows the ability to predict accurate distributions conditioned on a range of parton jet p T values. We note that even in the high parton jet p T regime in which there is little training data, the model is still able to produce reasonable conditional distributions as seen in the bottom right of figure 5.

Conclusions
In this study, we make use of advances in deep generative models and demonstrate the utility of cWGANs as a fast detector simulator for jets. The cWGAN is capable of producing realistic conditional distributions of reconstructed jet p T in time that is orders of magnitude less than the conventional detector simulation process. Our model also presents an advantage over existing fast detector simulators in that it takes the partons in an event pre-hadronization as an input. This saves time in the overall event simulation as event generators do not need to be run all the way through hadronization. An important next step in the development of cWGANs for detector simulation is the addition of pileup interactions to the training data. We chose not to include pileup interactions in our simulated events in this preliminary study, but this is a potential avenue for further investigation.