Decoding Photons: Physics in the Latent Space of a BIB-AE Generative Network

Given the increasing data collection capabilities and limited computing resources of future collider experiments, interest in using generative neural networks for the fast simulation of collider events is growing. In our previous study, the Bounded Information Bottleneck Autoencoder (BIB-AE) architecture for generating photon showers in a high-granularity calorimeter showed a high accuracy modeling of various global differential shower distributions. In this work, we investigate how the BIB-AE encodes this physics information in its latent space. Our understanding of this encoding allows us to propose methods to optimize the generation performance further, for example, by altering latent space sampling or by suggesting specific changes to hyperparameters. In particular, we improve the modeling of the shower shape along the particle incident axis.


Introduction
High-quality simulations of fundamental processes and particle interactions with complex detectors are crucial to data analysis in high energy physics. Especially in the context of increasing data volumes from upcoming runs of the Large Hadron Collider (LHC) and future experiments, the production of datasets using Monte-Carlo-based simulators is increasingly becoming a computing bottleneck [1].
A way to accelerate simulations is based on generative machine learning models that leverage recent advances in computer vision and are implemented parallelizable on graphic processing hardware. Such fast simulations based on Generative Adversarial Networks (GANs) [2] for calorimeter physics were first introduced in Ref. [3] and have seen active development in recent years [4][5][6][7][8][9][10][11]. This approach starts with a small dataset obtained using classical simulation techniques and aims to amplify its usable statistics by training a generative model. The principal feasibility of amplification was shown in Ref. [12].
Inspired by Ref. [13], we have previously implemented an improved Bounded Information Bottleneck Autoencoder (BIB-AE) architecture and shown its generation accuracy for various differential distributions of photon shower data in a high granularity calorimeter [14]. The BIB-AE architecture unifies ideas from different generative approaches, including GANs and Variational Autoencoders (VAE) [15]. As an autoencoder, the model encodes input photon showers into a latent space from which in turn newly generated showers are sampled. The information bottleneck (IB) [16] refers in this context to the principle that the model optimizes the latent encoding while maximizing the mutual information between input and output. This contribution explores methods to understand the physics encoded in the latent space and introduces optimizations for improved generation fidelity. As opposed to Ref. [17] we do not aim to explicitly shape the latent space to match physical distributions but rather investigate how the deviations and correlations of the optimally Gaussian normal latent space features correspond to physically important observables. Compared to Ref. [18] we focus on an information-theoretic perspective and investigate correlations with physical observables instead of the topological structure of the latent space.
In the following, we first briefly introduce the data (Sec. 1.1) and our BIB-AE architecture (Sec. 1.2). We then investigate the connection between generative performance and information encoded in the latent space in Sec. 2, the correlation between learned latent space distributions and physical observables in Sec. 3, and see how this can be used to improve generative performance in Sec. 4. We close with a summary of results and draw our conclusions in Sec. 5.

Photon showers in a high granularity calorimeter
Calorimeters are an essential part of detectors used at high-energy particle colliders. They measure the energy particles deposit when interacting with material. Particles interacting with the matter in the calorimeter can produce secondary particles resulting in cascades or showers. Such a particle shower is created for example by an initial electromagnetically interacting photon.
Modern sampling calorimeters are built in a sandwich structure of measuring active layers interspersed with dense passive material. The active material of modern high granularity calorimeters consists of many small cells that are read out separately, yielding high resolution 3-dimensional measurements of particle showers.
We created our photon shower dataset using the Geant4 [19] toolkit and a simulation of the SiW electromagnetic calorimeter in the International Large Detector (ILD) concept [20]. The simulated calorimeter section comprises 30 active layers with each 900 5x5 mm 2 calorimeter cells in a rectangular grid resulting in 3d images of 30×30×30 = 27, 000 pixels. Our dataset consists of 950k photon showers with incident energies uniformly distributed between 10 and 100 GeV. This is the same dataset as used for Ref. [14] and we refer to that publication for additional details. 1

The BIB-AE model
The BIB-AE architecture consists of several building blocks: An encoder network mapping the input calorimeter images into a latent representation; a decoder network transforming the latent representation back into calorimeter images; a Post-Processor network refining the pixel values of the decoded image; a reconstruction critic network calculating the Wasserstein-distance between encoded and decoded image; and a latent critic network to regularize the latent space. The whole model is trained in two stages: First, the encoder, decoder, and critics are trained until sufficient fidelity is reached; afterwards, the whole model is trained in conjunction with the Post-Processor network to improve the accuracy of the cell energy generation. To generate energy dependent samples, the BIB-AE is conditioned on the incident particle energy. An overview of the architecture is shown in Figure 1. A more detailed discussion of the network is provided in [14].
Like in any VAE-based model, the trained BIB-AE can be used to generate calorimeter shower images by sampling the latent space from Standard Normal distributions. To achieve good generation results, the latent space needs to be regularized towards such a Normal distribution. For this regularization the BIB-AE model employs several loss terms during training: A Kullback-Leibler divergence (KLD) loss L KLD , the output of a latent critic network L latent-critic , and a latent Maximum Mean Discrepancy (MMD) [21] term L latent-MMD . Each latent regularizer contribution is scaled with a weight β yielding a combined latent loss of L total-latent = β KLD · L KLD + β latent-critic · L latent-critic + β latent-MMD · L latent-MMD (1) with the KL divergence of two discrete probability distributions P and Q defined as and calculated via In the context of this publication, latent variables Z i are Gaussian distributions with two trainable parameters µ i and σ i (Z i ≡ N(µ i , σ 2 i )) regularized towards a Standard Normal distribution. Its sampled values are z i ∼ Z i .
In previous work we have implemented the BIB-AE architecture with 24 trainable latent variables and an additional 488 variables that are not encoded but sampled straight from a Standard Normal distribution during training [14]. We term the number of trainable latent space variables the latent space size n. Hence the total KLD loss is given by L KLD = n i=1 D KL,i . The loss weight β KLD has the highest impact on the latent regularization as its scaling defines the magnitude of the KL divergence. Here the KL divergence measures the information content of the latent space [22,23].

Different latent space sizes
Intuitively, for fixed β KLD , higher latent space sizes n should yield an increased total information in the latent space until a maximum corresponding to the showers' intrinsic relevant information is reached. We test this by re-training a BIB-AE architecture with latent space sizes between 2 and 512 for fixed β KLD = 0.05. The number of additional Standard Normal sampled variables is adjusted such that the total number of 512 decoded variables stays constant.
In Fig. 2 (left) we have sorted the trainable latent space variables by their individual KLD calculated via Eq. 3. On the vertical axis, the total information (i.e. the sum of KLD values up to and including latent variable i) is shown. Indeed, we observe increasing total encoded information with increasing latent space size until a saturation at approximately 45 nats (≈ 64 bits) is reached around a latent space size of 64. After that, a larger latent space does not substantially increase the learned information.
Next, we consider the KLD per latent space variable in Fig. 2 (right). All models follow a similar pattern: Only a few variables contain a high amount of information (high KLD). In particular, there are always two variables that encode significantly more information than the remaining ones. Furthermore, about 60 variables contain > 0.3 nats of information depending on the latent space size. Naively, we would expect the most efficient use of the latent space at a size of n = 64 to yield optimal performance. Evaluating the performance of generative models is not straightforward and constitutes an active topic of research. Methods such as Inception Score [24] were proposed to evaluate models which produce photographic images. However, such scores are typically domain-specific and cannot directly be applied to our dataset. We therefore define a problem specific fidelity score S JSD that summarizes the performance across a number of physically relevant observables. The score is calculated by combining the Jensen-Shannon Table 1. Fidelity score S JSD for the best epoch of multiple model configurations with different latent space sizes n. For n = 24 the best score out of multiple training runs is given, while the mean score for those trainings is: S JSD,24 = 1.02 ± 0.12. Only one training each was performed for sizes n 24.  Fig. 6, namely the visible cell energy, the total shower energy, the occupancy, the center of gravity in z as well as the radial and longitudinal energy distributions. These are some of the most relevant global differential distributions for photon shower analysis and were applied previously to judge model performances [14]. Additional details on how the score is calculated are given in Appendix A.
In Table 1 we show the fidelity score for different values of the latent space size n. Lower values correspond to better agreement with the underlying slow simulation. For very small n, the performance increases with n until the best observed value at n = 24. Seven trainings with identical network setup but different random weight initialization were performed for this point to obtain an estimate of the associated uncertainty (calculated as the standard deviation of individual results at n = 24). Further hyperparameters were kept the same as in Ref. [14]. In the table the best score out of those trainings is given. Limited computing resources due to several days of training needed per model did not allow for a wider estimation of the fidelity scores. For larger n the performance is approximately stable within the uncertainty observed for n = 24. This implies that maximum information content of ≈ 45 nats encoded in the latent space is not needed for optimal generative performance.

Correlations between latent space and physics
As only a few variables seem to encode most of the shower information, we investigate what kind of physics information is learned by these variables. In Fig. 3 the Pearson correlation coefficients ρ between different shower physics distributions and the distributions of the sampled z i for the five highest KLD latent variables as well as the incident particle energy (which is used for conditioning and is included as a latent variable in the BIB-AE) are shown for four different model configurations. In this case the sampled z i values are obtained from the encoded latent space via N(µ i , σ 2 i ), not from a Standard Normal distribution N(0, 1). The physics distributions include the first and second moment in each spatial dimension 2 -the first moment corresponds to the center of gravity -the visible energy, the incident particle energy, the number of hits, and the fraction of deposited energy in each third of the calorimeter (in the z-direction).
Regardless of the model configuration, it is apparent that the highest KLD latent variable strongly correlates (approx. ρ = 0.9) with the center of gravity along the shower direction z (and in turn to the fraction of deposited energy in the first and last third of the calorimeter). Another variable is correlated (approx. ρ = 0.5) to the second moment in z (and to the energy fraction in the middle of the calorimeter). It appears that of all the shower variables, the center of gravity in z (CoG-Z) of each shower is encoded into these two latent variables. This is important as the CoG-Z is not as much correlated to the incident particle energy (approx. ρ = 0.4) on which the BIB-AE is conditioned. Hence the BIB-AE learns the CoG-Z of each shower and uses it in the decoding step for reconstruction. Interestingly, this pattern is very stable over multiple independent training runs and even different latent space sizes n. 2 The incident photon enters the calorimeter in the center of the x-y plane at z=0 and traverses along the z-axis.  We can use this observation to improve the CoG-Z distribution in the generated events (see Fig. 6 (bottom left)). This distribution was previously not particularly wellmodeled since the generation did not take this latent space correlation into account. In addition, the targeted sampling of a subspace of these latent variables allows to generate showers with a specific shower start. This is visualized in Fig. 4 with multiple 3d images of a decoded calorimeter shower in which only the highest KLD latent z variable was altered. This variable change leads to a different shower start and hence to an altered center of gravity in the z-axis. Figure 5 visualizes the energy deposition per layer in z-direction of these five decoded shower images.

Improving generative performance
Understanding the encoded shower information, particularly the center of gravity, in the latent space helps us make educated optimization choices for improving model performance. Specifically, we can increase generation fidelity by either regularizing the latent space more strongly or by leveraging and sampling from the information rich non-Gaussian distributions. Either optimization path can be approached in different ways. We have chosen one exemplary method for each: (1) By increasing β KLD the overall KLD in the latent space is reduced, yielding latent distributions stronger regularized towards Standard Normal distributions and therefore more accurate generative sampling from such a N(0, 1) distribution; or (2) keeping the already trained model but using a second density estimator -such as Kernel Density Estimation (KDE) [25] -on the latent variables and sampling directly from the encoded latent space. The former approach is motivated by [26] while the latter mirrors a method for the Buffer-VAE from Ref. [27].

Adjusting the Kullback-Leibler divergence
Our baseline model uses a latent KLD weight of β KLD = 0.05. However, as a higher value for β KLD leads to a lower KLD value, less information is encoded in the latent space. Therefore, the latent space more closely approaches a Standard Normal distribution and sampling from N(0, 1) in the generation step should yield showers resembling the Geant4 truth more closely. As shown in Fig. 6 (bottom left) this improves the CoG-Z distribution compared to the baseline. However, there is a trade-off for other distributions, such as the total energy or energy sum (top center) and the number of hits (top right) which become narrower than the baseline and truth distributions. This can also be seen in Fig. 7: Except for low energies the energy linearity is better, but the relative width of the energy distributions is on average narrower than the baseline model. Figure 8 illustrates that for the highest (left) and second-highest (right) KLD latent variables, the sampled z distributions for β KLD = 0.4 are very similar to Normal distributions while they deviate significantly for the baseline value of β KLD = 0.05. Although improving the CoG-Z distribution, the overall fidelity score given in Table 2 is slightly worse for β KLD = 0.4.

Sampling from a Kernel Density Estimate
Another way to improve the generative performance, particularly the CoG-Z distribution, is to utilize latent variables highly correlated to the CoG-Z distribution. Using exactly the same model as in Ref. [14] (β KLD = 0.05 , n = 24) without retraining one can see in Fig. 8 that the encoded distribution deviates from a Standard Normal distribution. In the usual VAE-like setup one would regardless sample these variable from N(0, 1) to generate new samples, thereby ignoring the correlations between the latent space and the shower physics. Instead one could sample those latent variables from the distribution of the sampled z i values, which are sample from the encoded N(µ i , σ 2 i ) distributions. Since at least two variables as well as the incident energy are correlated to the CoG-Z distribution, one needs to account for correlations between latent variables when sampling. This can be done by encoding a sufficiently large number of showers (i.e. 500k) from the training set, applying a density estimation method such KDE, and then sampling new latent variables from it. In the BIB-AE case with 24 encoded latent variables plus energy conditioning, this leads to training a KDE of a 25-dimensional space. The resulting KDE kernel can be used as a probability density function for sampling the latent z variables for improved shower generation. As shown in Fig. 6 this KDE sampling approach yields global differential distributions very similar to the Geant4 truth; superior results for the CoG-Z distribution and the number of hits distributions in comparison to the other two models. The linearity in Fig. 7 closely resembles the baseline model, however the relative width of the energy distributions is on average slightly overestimated. The fidelity score in Table 2 is the best of all tested model configurations. The score for the model of β KLD = 0.05 was chosen as the best out of seven training runs and the same model was used to simply add the KDE sampling step. This illustrates another benefit of the KDE approach: It can be applied to any already trained VAE-like model without expensive re-training.

Summary & conclusions
Improving the simulation of calorimeter showers with generative models is an active topic of research motivated by these tasks' large resource consumption. As such generative models still require substantial training efforts and preclude large hyperparameter scans for optimization, we investigate how a better understanding of the latent space can be used to increase performance. While a BIB-AE architecture was used for these studies, the developed strategies should readily transfer to other generative models with an encoded latent space (i.e. VAE-like but not GAN-like architectures).
We first quantify the information encoded in the latent space and note that for a fixed value of β KLD = 0.05, it saturates at ≈ 45 nats. However, generative performance -as measured by a metric defined to take the relevant physical distributions into account -achieves its best value at a latent space of n = 24 with ≈ 28 nats. Put differently, more information encoded in the latent space will not necessarily translate into better generative performance.
This observation offers an interesting parallel to the information bottleneck principle [13,16]. It proposes that for a supervised classification task, the latent space Z should maximise its mutual information I with the true class labels C but minimise information irrelevant for classification between data examples X and latent space: Here L S is the supervised optimisation target, we minimise over parametric mappings φ from data to latent space, and the Lagrange multiplier β denotes the trade-off between the two goals. For unsupervised tasks, no class labels are available, and the problem becomes: which is also the core of the BIB-AE loss formulation [13]. It is a much more challenging compression problem as the entropy of a small number of class labels will, in general, be much smaller -and therefore easier to encode -than the entropy of the data distribution. We observed that without additional constraints, such as restricting the latent space size n, more information than needed for good generative performance is encoded in the latent space, suggesting the need for additional regularising constraints. An interesting open question for future research is therefore how the useful encoded information might be quantified. Regardless of the model configuration, only a few latent variables of the BIB-AE contain most of the shower information. Correlating the latent variables with various shower physics metrics reveals that the center of gravity in z-direction is always encoded into the two highest KLD latent variables. This encoding can be leveraged for targeted shower generation of photon showers with a specific shower start by sampling from a subspace of the highest KLD variable.
Furthermore, this observations can help improve the generative fidelity of the BIB-AE model. This can be achieved either by lowering the encoded KLD or by sampling directly from the encoded latent space density distribution, e.g. learned via Kernel Density Estimation. Forcing the latent distributions closer to unit Normal naturally improves physical observables most strongly correlated with the corresponding latent space variables with the highest-KLD values, and decreases the performance of the others. The latter approach yields the best results with the additional benefit of applying to the already previously trained BIB-AE model (or any other VAE-like model).
The increasing use of generative machine learning models motivates a closer look into their learned encoding. Especially in particle physics, the needed precision for many differential distributions over many orders of magnitude offers a rich laboratory to study the connection between generation fidelity and latent space. On the one hand, this offers several methods to probe and improve generative performance, for example by identifying poorly modeled distributions for which a discrepancy between encoded-into and sampled-from latent space exists. Resolving this discrepancy yields better-generated showers. On the other hand, the observed difference between maximum-information and best-performance latent space capacity raises an interesting problem for future studies.