Invertible Neural Networks in Astrophysics

. Modern machine learning techniques have become indispensable in many ﬁelds of astronomy and astrophysics. Here we introduce a speciﬁc class of methods, invertible neural networks, and discuss two speciﬁc applications, the prediction of stellar parameters from photometric observations and the study of stellar feedback processes from on emission lines


Introduction
Astronomy and astrophysics have always been highly data-intensive sciences.Large astronomical survey projects, conducted at modern Earth-bound or spaceborne observatories, or massively parallel astrophysical simulations, run at national and European supercomputing facilities, lie at the very forefront of the current "big data" deluge.
Whereas in the past, researchers were able to keep up with the amount of data pouring in by employing manual control and analysis methods, it became clear in recent years that this is no longer feasible, and that more automated and self-guided approaches are needed.Modern large-scale observational surveys or numerical simulation projects, such as those mentioned above, would not be possible without highly automated data-reduction pipelines that convert the raw data from the telescope or the supercomputer into a dimensionally-reduced and more science-ready form.Only this reduction enables an efficient analysis and astrophysical interpretation of the wealth of information available.Key concepts of artificial intelligence, driven by the ever increasing capabilities of modern machine learning techniques, are currently becoming a focal point of these developments.New neural network designs and supervised or unsupervised learning schemes allow for a comprehensive analysis of complex multi-scale astrophysical data with unprecedented accuracy and speed.
Here we introduce and discuss invertible neural networks (INNs) [1][2][3]30], which have been successfully applied in the astronomical and astrophysical context to the analysis of star clusters [31], planetary systems [21], stellar feedback processes [25], or the analysis of galaxy mergers [14] in recent proof-of-concept studies.

Machine Learning in Astronomy and Astrophysics
Machine learning employs statistical models to predict the characteristics of a dataset using samples of previously collected data without relying on physical models of the system.The introduction of machine learning for solving regression, classification and clustering problems has revolutionized scientific research, and in particular has provided e↵ective methods for analyzing big astronomical data [17,24].In order to construct a model from observed data, many methods rely on human-defined classifiers or 'feature extractors' [22].However, complex problems require algorithms that automate feature extraction by learning from large amounts of data.Such self-learned feature extraction algorithms are an integral part of the deep learning family, which is based on the construction of artificial neural networks (NNs) [20].While training NNs requires significant computational power, they achieve far higher levels of accuracy than classic machine learning for many non-linear problems.
There have been several recent studies that employ NN approaches to solve prediction tasks in astronomy and astrophysics.In the context of star cluster research, similar to the focus here, classical convolutional NNs have been employed to study stellar properties either from spectral [15,41] or photometric [33,44,46] data, or they have been trained on data from the European astrometric Gaia satellite to predict properties of stellar clusters in the Milky Way [10,28].Classical NN methods have also been used for analyzing and classifying galaxy properties, either based on training data from large observational surveys [45] or from numerical simulations of cosmic structure formation [23,47].Other studies in this context have focused on determining the properties of the underlying dark-matter halos [12,43] and on identifying merging galaxies or merger remnants from images [7,8,11,19].

Invertible Neural Networks
A specific type of NN architecture based on the concept of normalizing flows [27] are invertible neural networks (INNs).They have been introduced to address complex and highly ambiguous inverse problems [1].Unlike classical neural networks, which solve the inverse problem directly, INNs learn the forward process by using additional latent output variables to capture the information otherwise lost.Leveraging their invertible architecture, INNs then derive a solution for the inverse process without additional cost.Conditioned on the observations and the latent variable distribution, INNs can predict full posterior distributions, which is highly advantageous when studying multi-modal or degenerate problems, or when investigating complex correlations between parameters.
The advantage of invertible architectures is that the network automatically learns the inverse process when it is trained to approximate a known forward process.When considering degenerate problems or when taking uncertainties into account, an information loss is unavoidable in the forward process, such that di↵erent sets of physical parameters x are mapped onto identical observations y.Consequently, the degenerate y cannot uniquely explain the corresponding x.By introducing latent variables z that capture the information loss during the forward process, we can ensure a bijective mapping that could not be achieved with x and y alone.The original INN architecture links x and a unique pair of [y, z], making a bijective forward mapping f (x) = [y, z] and a inverse mapping x = f −1 (y, z) = g(y, z).The forward process has to be deterministic and there are certain requirements towards the intrinsic dimensionalities of x and y.Zero padding is necessary if the dimension of x is smaller than the dimension of [y, z].
The cINN architecture, as illustrated in Figure 1, avoids these problems [2,3,29,30,37].It uses a di↵erent mapping system by considering the observations y in both the forward and inverse process as a condition c: f (x; c = y) = z, x = g(z; c = y) [3].This approach has the advantage that there are no assumptions or restrictions about the intrinsic dimensionalities of x and y.It has the additional advantage that for very high-dimensional or complex datasets y, we can include a feature extraction network in the conditioning block and fully integrate it in the training process [2,3].This allows to employ cINNs in image processing tasks [13,32].introduction of machine learning for solving regression, classification and clustering problems has revolutionized scientific research, and in particular has provided e↵ective methods for analyzing big astronomical data [17,24].In order to construct a model from observed data, many methods rely on human-defined classifiers or 'feature extractors' [22].However, complex problems require algorithms that automate feature extraction by learning from large amounts of data.Such self-learned feature extraction algorithms are an integral part of the deep learning family, which is based on the construction of artificial neural networks (NNs) [20].While training NNs requires significant computational power, they achieve far higher levels of accuracy than classic machine learning for many non-linear problems.
There have been several recent studies that employ NN approaches to solve prediction tasks in astronomy and astrophysics.In the context of star cluster research, similar to the focus here, classical convolutional NNs have been employed to study stellar properties either from spectral [15,41] or photometric [33,44,46] data, or they have been trained on data from the European astrometric Gaia satellite to predict properties of stellar clusters in the Milky Way [10,28].Classical NN methods have also been used for analyzing and classifying galaxy properties, either based on training data from large observational surveys [45] or from numerical simulations of cosmic structure formation [23,47].Other studies in this context have focused on determining the properties of the underlying dark-matter halos [12,43] and on identifying merging galaxies or merger remnants from images [7,8,11,19].

Invertible Neural Networks
A specific type of NN architecture based on the concept of normalizing flows [27] are invertible neural networks (INNs).They have been introduced to address complex and highly ambiguous inverse problems [1].Unlike classical neural networks, which solve the inverse problem directly, INNs learn the forward process by using additional latent output variables to capture the information otherwise lost.Leveraging their invertible architecture, INNs then derive a solution for the inverse process without additional cost.Conditioned on the observations and the latent variable distribution, INNs can predict full posterior distributions, which is highly advantageous when studying multi-modal or degenerate problems, or when investigating complex correlations between parameters.
The advantage of invertible architectures is that the network automatically learns the inverse process when it is trained to approximate a known forward process.When considering degenerate problems or when taking uncertainties into account, an information loss is unavoidable in the forward process, such that di↵erent sets of physical parameters x are mapped onto identical observations y.Consequently, the degenerate y cannot uniquely explain the corresponding x.By introducing latent variables z that capture the information loss during the forward process, we can ensure a bijective mapping that could not be achieved with x and y alone.The original INN architecture links x and a unique pair of [y, z], making a bijective forward mapping f (x) = [y, z] and a inverse mapping x = f −1 (y, z) = g(y, z).The forward process has to be deterministic and there are certain requirements towards the intrinsic dimensionalities of x and y.Zero padding is necessary if the dimension of x is smaller than the dimension of [y, z].
The cINN architecture, as illustrated in Figure 1, avoids these problems [2,3,29,30,37].It uses a di↵erent mapping system by considering the observations y in both the forward and inverse process as a condition c: f (x; c = y) = z, x = g(z; c = y) [3].This approach has the advantage that there are no assumptions or restrictions about the intrinsic dimensionalities of x and y.It has the additional advantage that for very high-dimensional or complex datasets y, we can include a feature extraction network in the conditioning block and fully integrate it in the training process [2,3].This allows to employ cINNs in image processing tasks [13,32].The particular example depicted here consists of eight affine coupling blocks interchanged with permutation layers.It was developed as proof-of-concept for the analysis of diagnostic emission lines from young star clusters discussed by Kang and colleagues [25].The zoom-in panel of the conditional affine coupling block shows how the information is passed through the block in the forward direction.Further details are provided by Ksoll and collaborators [31].
The posterior distribution of physical parameters, p(x|y), is estimated on the basis of the inverse mapping f −1 = g.During training, we prescribe the latent variables to have a standard normal probability distribution p(z) = N(z, 0, I) with zero mean and unit standard deviation, where I is the identity matrix with a dimension of dim(z) ⇥ dim(z).Following the inverse process x = g(z; c), the posterior distribution is a transformation of the known distribution p(z) to x-space, conditioned on the observation.

Example 1: INN for Stellar Parameters
In the first example we employ the cINN approach to the task of predicting physical parameters of individual stars based on photometric observations of spatially resolved clusters [31].
In this pilot study we train and test the neural network on synthetic data from the PARSEC stellar evolutionary models [9] and perform a benchmark analysis on real observational data obtained by the Hubble Space Telescope for the young cluster Westerlund 2 [40] and the old globular cluster NGC 6397 [36].These clusters are chosen to cover the extremes of the cluster range, i.e. very young and very old, in order to gain first insights into the systematics of our approach.We construct the synthetic training sets by adopting isochrone model tables of the correct metallicity for Westerlund 2 and NGC 6397, respectively.The prediction of stellar mass, luminosity, e↵ective temperature and surface gravity works extraordinarily well with posterior distributions that are narrowly constrained around the true values.Determining the Figure 2. Illustration of the ability of the cINN to capture degeneracies in the physical model and properly cope with multi-modal posterior distribution functions, adopted from Ksoll and collaborators [31].The middle panel shows a zoom-in of the optical color-magnitude diagram (CMD) of the young massive star cluster Westerlund 2 [40], with its stars color-coded according to the maximum a posteriori (MAP) estimates of log(age).The four smaller panels show the predicted age posterior distributions of highlighted stars.Note that stellar age is one of the most difficult stellar parameters to predict from photometric and spectroscopic observations.The bottom left panel is an example pre-main-sequence star for which our approach provides excellent results, returning a very narrow age distribution at the proposed cluster age.The remaining three cases are taken from stars likely on the turn-on of the mainsequence for which the MAP age estimate is significantly above the suggested age of Westerlund 2. stellar age is a more difficult task, as illustrated in Figure 2. The predicted posteriors tend to be broader and often exhibit multi-modalities, revealing ample degeneracies in the age prediction.While we can confirm that the true value is part of the predicted distribution in more than 99% of the cases, there are several instances where it does not coincide with the most likely outcome of the posterior, falling into a second peak instead.

Example 2: INN for HII Region Diagnostics
Another application of the INN architecture is the study of physical properties of extragalactic star clusters and star-forming clouds from individual emission lines of HII regions.We present a cINN that predicts the posterior distribution of seven physical parameters (cloud mass, star formation efficiency, cloud density, cloud age as in the age of the first generation stars, age of the youngest cluster, the number of clusters, and the evolutionary phase of the cloud) from the luminosity of 12 optical emission lines, and test our network with synthetic models that are not used during training.The training database is constructed by using the WARPFIELD Emission Predictor [35], which allows us to collect both cloud properties and corresponding observable quantities (i.e.line luminosity).WARPFIELD-EMP describes the evolution of a cluster, expanding bubble, and the surrounding cloud using the 1D stellar feedback code WARPFIELD [38] and calculates detailed emission predictions based on the output from WARPFIELD with the help of CLOUDY [18] and the radiative transfer code POLARIS [39].WARPFIELD takes into account several feedback mechanisms (i.e., stellar winds, radiation pressure, thermal gas pressure, supernovae, and gravity) self-consistently.Illustration of the ability of the cINN to capture degeneracies in the physical model and properly cope with multi-modal posterior distribution functions, adopted from Ksoll and collaborators [31].The middle panel shows a zoom-in of the optical color-magnitude diagram (CMD) of the young massive star cluster Westerlund 2 [40], with its stars color-coded according to the maximum a posteriori (MAP) estimates of log(age).The four smaller panels show the predicted age posterior distributions of highlighted stars.Note that stellar age is one of the most difficult stellar parameters to predict from photometric and spectroscopic observations.The bottom left panel is an example pre-main-sequence star for which our approach provides excellent results, returning a very narrow age distribution at the proposed cluster age.The remaining three cases are taken from stars likely on the turn-on of the mainsequence for which the MAP age estimate is significantly above the suggested age of Westerlund 2. stellar age is a more difficult task, as illustrated in Figure 2. The predicted posteriors tend to be broader and often exhibit multi-modalities, revealing ample degeneracies in the age prediction.While we can confirm that the true value is part of the predicted distribution in more than 99% of the cases, there are several instances where it does not coincide with the most likely outcome of the posterior, falling into a second peak instead.

Example 2: INN for HII Region Diagnostics
Another application of the INN architecture is the study of physical properties of extragalactic star clusters and star-forming clouds from individual emission lines of HII regions.We present a cINN that predicts the posterior distribution of seven physical parameters (cloud mass, star formation efficiency, cloud density, cloud age as in the age of the first generation stars, age of the youngest cluster, the number of clusters, and the evolutionary phase of the cloud) from the luminosity of 12 optical emission lines, and test our network with synthetic models that are not used during training.The training database is constructed by using the WARPFIELD Emission Predictor [35], which allows us to collect both cloud properties and corresponding observable quantities (i.e.line luminosity).WARPFIELD-EMP describes the evolution of a cluster, expanding bubble, and the surrounding cloud using the 1D stellar feedback code WARPFIELD [38] and calculates detailed emission predictions based on the output from WARPFIELD with the help of CLOUDY [18] and the radiative transfer code POLARIS [39].WARPFIELD takes into account several feedback mechanisms (i.e., stellar winds, radiation pressure, thermal gas pressure, supernovae, and gravity) self-consistently.

Summary
The proof-of-concept studies mentioned here have successfully demonstrated the large potential and versatility of invertible neural networks for astronomical and astrophysical applications.Despite addressing a wide range of di↵erent scales and physical systems, our approach has in common that the ground-truth sample used to train the neural network is based on synthetic data, generated either from large-scale numerical simulations, Markov-chain Monte Carlo (MCMC) methods, or a database of physical models to generate the physical feature space x, combined with advanced post-processing and radiative transfer methods to produce the corresponding space of synthetic observables y.Training on synthetic data is needed in many astrophysical applications, because it is often not possible to build a ground-truth sample based on observational data only, and even in those cases for which sufficiently well understood observations exist, the numbers are usually too low to adequately train a neural network.Furthermore, training on synthetic data gives us a high degree of control over the problem and allows us to better understand the flow of information through the network, so that we can validate the network performance with high precision and accuracy.We can address the important question of measurement errors and internal degeneracies in the astrophysical system, i.e. in the mapping from x to y, and we can quantitatively assess how they influence the posterior distribution function.Once the INN is fully tested and characterized, its application to real astronomical observation allows us then also to assess the fidelity and accuracy of the underlying physical model that was used to train the network.

Figure 1 .
Figure1.Schematic overview of the cINN architecture with physical input parameters x and observational features y as a condition c.The latent variables z capture the information loss during the (forward) training phase.The particular example depicted here consists of eight affine coupling blocks interchanged with permutation layers.It was developed as proof-of-concept for the analysis of diagnostic emission lines from young star clusters discussed by Kang and colleagues[25].The zoom-in panel of the conditional affine coupling block shows how the information is passed through the block in the forward direction.Further details are provided by Ksoll and collaborators[31].

Figure 2 .
Figure2.Illustration of the ability of the cINN to capture degeneracies in the physical model and properly cope with multi-modal posterior distribution functions, adopted from Ksoll and collaborators[31].The middle panel shows a zoom-in of the optical color-magnitude diagram (CMD) of the young massive star cluster Westerlund 2[40], with its stars color-coded according to the maximum a posteriori (MAP) estimates of log(age).The four smaller panels show the predicted age posterior distributions of highlighted stars.Note that stellar age is one of the most difficult stellar parameters to predict from photometric and spectroscopic observations.The bottom left panel is an example pre-main-sequence star for which our approach provides excellent results, returning a very narrow age distribution at the proposed cluster age.The remaining three cases are taken from stars likely on the turn-on of the mainsequence for which the MAP age estimate is significantly above the suggested age of Westerlund 2.

Figure 3 .
Figure 3. Two-dimensional histogram showing the ratios of diagnostic emission lines of singly ionized nitrogen, [NII], doubly ionized oxygen, [OIII], and atomic hydrogen, H↵ and Hβ, for all of the models in the test set, where brighter color indicates a higher number of models.Overlaid as yellow stars are the corresponding values of three representative cases.Zoom-in panels show the distribution of the line ratios that we recover if we sample the posterior distribution for each example model 1024 times and use the resulting values as input for new WARPFIELD-EMP calculations.The true line ratio values for each example model are represented by red lines in these zoom-in panels.Green circles in each zoom-in panel indicate the area in which 68% of the models are included from the centre of the distribution.The image is adopted from Kang et al. [25].