Baler - Machine Learning Based Compression of Scientific Data

: Storing and sharing increasingly large datasets is a challenge across scientific research and industry. In this paper, we document the development and applications of Baler - a Machine Learning based data compression tool for use across scientific disciplines and industry. Here, we present Baler’s performance for the compression of High Energy Physics (HEP) data, as well as its application to Computational Fluid Dynamics (CFD) toy data as a proof-of-principle. We also present suggestions for cross-disciplinary guidelines to enable feasibility studies for machine learning based compression for scientific data.


Introduction
Many different fields of science share a common issue; storing ever-growing datasets.By the end of the next decade, the Large Hadron Collider (LHC) experiments will have over an order of magnitude more data to analyze than currently [1-3]; the Square Kilometre Array (SKA) experiment is expected to record 8.5 EB of data over its 15-year lifespan [4] and fields such as Computational Fluid Dynamics (CFD) rely on TB-sized simulation samples that need to be stored and shared.Without significant R&D, the datasets expected to be collected by big-data science experiments are projected to exceed the available storage resources (see e.g.Fig. 2 of Ref. [1] for the case of the ATLAS experiment at the LHC).This cross-disciplinary issue is not limited to scientific research and extends to industrial operations [5].

Lossy data compression in High Energy Physics
A common mitigation strategy to this problem involves compressing data using lossless algorithms, see e.g.Refs.[6][7][8].Once the storage limit is reached, one is forced to discard parts of the dataset, or only save certain features of the data.Generally, this can be done without impacting the overall scientific program of the experiments, for example by using a data selection system called trigger that only stores data satisfying certain pre-determined characteristics that ensure the dataset will be aligned with the experiment's main scientific goals.However, saving only a subset of data is not ideal for processes where additional statistical Figure 1: Illustration of an autoencoder consisting of an input and output layer.In between the input and output, there are hidden layers and a latent space.For compression the dimensionality of the latent space is less than the input and output layers.Modified from Ref. [12].power is necessary, e.g. for rare signals buried in high-rate backgrounds.In these cases, one can consider using lossy compression algorithms that reduce the data size ideally beyond what lossless compression algorithms can do [9], using approximation and partial data discarding, at the expense of data fidelity.One limitation of lossy compression is that to obtain high compression ratios with low data loss, the compression algorithm must be tailored to the input data; for instance, MP3 [10] which is an example of a lossy compression algorithm tailored to audio waves.Thereby, a general solution to this cross-disciplinary problem is hard to obtain.To address this we present Baler; a lossy data compression tool based on the machine learning autoencoder architecture, which tailors the compression to the user's dataset.It is also important to note that for such a tool to be usable in a scientific experiment, the loss in data quality must be controlled and it must also be deemed to be tolerable/negligible with respect to other sources of experimental jitter.

Autoencoders for lossy data compression
Autoencoders (AEs) [11] are a class of unsupervised, deep neural networks characterized by an encoder, a central latent space, a decoder, and a target space of the same dimensionality as the input space, as illustrated in Figure 1.The encoder, is a neural network that maps each input, x, to an abstract latent point z, generally of lower dimensionality than the input.The decoder then extrapolates the latent space back to the same dimensions as the input to give the reconstructed output, x.AEs can therefore be trained to reconstruct the various features of the input data, while their bottleneck structure prevents them from simply learning the identity map.The dimensionality of the latent space is of particular importance, as it determines the amount of compression achieved, with the latent points being the compressed data and the decoder acting as the decompression algorithm.
AE based compression of scientific data has shown promising results for multiple fields of study such as meteorology, cosmology, computational fluid dynamics, crystallography etc. [13][14][15][16][17][18][19].The use of AEs for data compression in High Energy Physics (HEP) has also shown promising results in previous studies [20][21][22][23].A number of these studies focus on the compression of objects directly as the data is taken (online compression), which requires training a model on a dataset and using it to compress a different dataset with the same input characteristics.Offline compression on the other hand corresponds to the case where the model is trained to compress a dataset and is used to compress that dataset only.In this work, we deal with offline compression as a stepping stone toward online compression and leave the latter for future studies.

Baler methodology
AE based compression workflows generally consist of data pre-processing, model architecture selection, model training, compression and decompression using a trained model, and performance evaluation via selected metrics.Baler is an intuitively packaged and modular tool that allows for easy modifications in any component of the workflow.Baler is available in its open-source software repository [24].In this section, we discuss the setup for HEP data compression as a working example.

HEP model design
The benchmark HEP AE is built with 3 fully connected neural network layers in both the encoder and decoder.The encoder layers have 200, 100, and 50 nodes respectively, while the decoder has a symmetrically inverted layer structure.We train our models by minimizing the loss function: where, β is a free hyperparameter that controls the contribution of each term to the net loss, L reco is a suitable reconstruction error metric and L 1 is a L 1 -type regularization term to enforce sparsity in the AE weights.In this work we choose L reco to be the mean squared error (MSE) summed over each mini-batch, defined as, Here, m is the batch-size and n is the number of variables in each data entry.X is the vector of batched model inputs, and X is the vector corresponding model reconstruction where x ∈ X and x ∈ X. L 1 introduces sparsity to the model and reduces its storage size, thereby reducing the overheads as discussed in Sec.2.2.L 1 is defined as, and has previously been shown to perform better with regards to HEP data compression [25].

HEP training setup and evaluation metrics
The models are built using the PyTorch [26] framework and optimized using the Adam minimizer [27].A learning rate of 10 −3 is used in combination with a learning rate scheduler, namely the ReduceLROnPlateau method from PyTorch.The scheduler uses a patience of 50 epochs, a reduction factor of 0.5, and a minimum learning rate of 10 −6 .We train for 1000 epochs with a batch size of 512, and an early stopping strategy with a patience of 100 epochs.L total converges to an order of 10 −5 .We consider the mean and RMS of the residual and response as our evaluation metrics, where, with x i being the original data and xi being the data reconstructed from the compressed file.Another important metric for compression is the compression ratio, defined as, However, unlike some traditional compression algorithms, AE compression requires auxiliary files to be saved.Auxiliary files mainly include decoder weights and biases along with the corresponding PyTorch metadata required to load the model when decompressing.This information is saved using the save functionality provided by PyTorch.Auxiliary files can also include auxiliary data such as normalization features and data headers.Taking this into account, the actual compression ratio is To avoid the results being skewed due to a single specific seed being used in the training, the training was done using 10 different seeds, and the performance evaluation was done on the 5 best performing seeds.For the offline compression case studied here, choosing the best seed is considered as a type of hyperparameter optimization.The impact of different seeds will be studied in the future.

Baler input data
Baler supports NumPy [28] arrays as input and output.This format was chosen because NumPy arrays are an easy-to-handle data format that is already widely used across various scientific disciplines.Also, the PyTorch [26] library at the core of Baler uses tensors and conversion from the user's original file format is necessary and simple with NumPy arrays.In this initial study, we will focus on HEP data, and touch on preliminary studies using data from CFD.

HEP input data
Processes involving the strong force dominate proton-proton collision interactions at the LHC.Therefore, one of the most commonly occurring observable objects at experiments like ATLAS and CMS are the collimated showers of particles resulting from these strong processes, reconstructed into jets [29].To showcase Baler's performance on HEP data, we use a subset of the jet data recorded by the CMS experiment at the LHC in 2012, released as open data under the Creative Commons CC0 waiver [30].In this dataset, each entry is a proton-proton collision event, and each event can contain multiple jets.Each collision event is independent from other events and there is no time-dependency in this data.In the data, jets are represented as 4-vectors (p T , η, ϕ, m).Where p T is the momentum of the jet perpendicular to the direction of the colliding proton beams, η is a quantity related to the angle between the jet momentum and the beam, ϕ is the azimuthal angle measured around the beam axis, and m is the mass of the jet.Collectively, these variables are called the jet's four-momentum which are the most relevant variables for LHC measurements and analyses involving jets.Each jet has several other associated variables, for example "jet area" is a measure of the footprint size of the jet.The full list of variables and further information about the content of the dataset we use for testing Baler can be found in Ref. [31].At this stage, it is not clear whether it would be recommended to use a lossy compression algorithm on the four-momenta, but we include them in the bench-marking of the algorithm for this initial study.

HEP data pre-processing
To simplify the use of HEP data in Baler, the data is pre-processed.First, the data is flattened as the original hierarchical data structures of the input data are not supported by current machine learning frameworks such as PyTorch.Therefore each jet in the dataset is independent from the others and so correlations within events are lost1 .Secondly, features of the data which are non-numerical, are dropped.Both these pre-processing steps are limitations in the applicability of this compression method that can be overcome at a later stage.This preprocessing step removes nine variables only containing zeroes, and further truncates 15% of the data, yielding a final dataset with a size of 116.9 MB consisting of 24 variables and 608, 978 entries.

Baler performance on HEP data
As described in Section 2.2, we perform multiple training runs on the same dataset with different random seeds to account for statistical variations introduced by seeds.Baler's performance on a certain dataset is visualized and evaluated by looking at the variable distributions together with the distribution of responses and/or residuals.In Figure 2 we show, for one seed, the distribution of four selected variables before and after compression using R = 1.7; together with the response distribution for each variable.The mean and root mean squared (RMS) of the response distributions are presented in the figure.In [33,34], Baler's performance on the remaining 20 variables are presented at both R = 1.7 and R = 6.
The difference between R and R actual for HEP data is negligible.On disk, the total auxiliary file size reaches at the very most 550 KB.This means that for our case, where our input file size is 116.9MB, we obtain: R = 1.7 → R actual ≈ 1.59.As the auxiliary file sizes for HEP do not increase with the number of entries they become negligible.

Baler application in other scientific fields
Since a major goal of Baler is to investigate the feasibility of AE compression in different fields of science, Baler was also tested on simulated toy data from CFD.The simulated dataset used for this test was the x-component of velocity for air flowing over a cube mounted to a wall.For simplicity, we only considered one slice in 3D space, making the compression of the 2D data simple using a convolutional-AE model where the encoder and decoder are Convolutional Neural Networks [35]. Figure 3a and 3b show the 2D data before and after compression and decompression, with R = 88.Figure 3c shows the difference between the two which is on a scale four orders of magnitude smaller.These results show Baler's wider applications to multiple scientific disciplines.

Conclusions and Outlook
In this work, we motivate the need for effective data compression strategies as a solution to the growing storage issues related to large data volumes across many disciplines of scientific research.We present Baler as a modular solution to leverage machine learning based lossy data compression.We identify and define two major use cases of the tool, namely, online and offline data compression.We evaluate performance for offline compression of HEP and CFD data as a proof-of-concept to demonstrate Baler's flexibility.The auxiliary file size produced and the achievable compression ratio are dataset dependant.We note that for the specific HEP dataset used, gzip outperforms Baler [33,34].However, for the CFD case we observe that Baler outperforms gzip in terms of compression ratio [33], with the added tradeoff of increased auxiliary file size and with future implementation of online compression, the auxiliary file size will be of less significance.Near future extensions to this work include assessing performance variations related to dataset sizes and support for error-bound compression.Though we provide guidelines for using Baler to perform feasibility studies for a given dataset, there is currently no method to quickly project the likelihood of a dataset being suitable for compression with Baler.A potential method is to calculate a coefficient of variation for a given dataset as described in [36] and use this as a likelihood metric.This implementation along with studies on other potential solutions for this problem are marked as features for future releases of Baler.
To deal with variations across dataset sizes we plan to perform follow-up studies exploring different HEP datasets and input representations that may involve low-level detector data.
Another related limitation is compressing files larger than RAM, we plan to test industry standards such as optimal caching of objects in memory.These solutions are viable for offline compression since there are no associated latency or resource constraints in this case.
However, online compression is a major area of study we intend to investigate given its high potential in large HEP experiments that generate data at very high rates.To tackle the problem of online compression we would need better generalization capabilities within the machine learning models and a potential way to achieve this with unsupervised learning is to use probabilistic generative models such as variational autoencoders and normalizing flows.

Figure 2 :
Figure 2: Distributions of four selected jet variables.Alongside each distribution is a histogram of the response for that variable after compression with R = 1.7.

Figure 3 :
Figure 3: A Computational Fluid Dynamics toy simulation showing x-component of air velocity before compression with R = 88 (a), after decompression (b), and the difference between the two (c).