GPU ACCELERATION OF DOPPLER BROADENING FOR NEUTRON TRANSPORT CALCULATIONS1

In radiation transport calculations, the effects of material temperature on neutron/nucleus interactions must be taken into account through Doppler broadening adjustments to the microscopic cross section data. Historically, Monte Carlo transport simulations have accounted for this temperature dependence by interpolating among precalculated Doppler broadened cross sections at a variety of temperatures. More recently, there has been much interest in on-the-fly Doppler broadening methods, where reference data is broadened ondemand during particle transport to any temperature. Unfortunately, Doppler broadening operations are expensive on traditional central processing unit (CPU) architectures, making on-the-fly Doppler broadening unaffordable without approximations or complex data preprocessing. This work considers the use of graphics processing unit (GPU)s, which excel at parallel data processing, for on-the-fly Doppler broadening in continuous-energy Monte Carlo simulations. Two methods are considered for the broadening operations – a GPU implementation of the standard SIGMA1 algorithm and a novel vectorized algorithm that leverages the convolution properties of the broadening operation in an attempt to expose additional parallelism. Numerical results demonstrate that similar cross section lookup throughput is obtained for on-the-fly broadening on a GPU as cross section lookup throughput with precomputed data on a CPU, implying that offloading Doppler broadening operations to a GPU may enable on-the-fly temperature treatment of cross sections without a noticeable reduction in cross section processing performance in Monte Carlo transport codes.


INTRODUCTION
Neutron transport algorithms require accurate representation of the fundamental underlying cross section data. An important aspect of this is accurate modeling of the impact of material temperature on cross sections due to the thermal motion of the target nuclei during particle transport. This effect is modeled via Doppler broadening, which has the effect of "broadening" sharp cross section resonances into smoother, wider peaks as material temperature increases. Traditionally, effects of material temperature have been handled in neutron transport calculations by pre-calculating broadened cross sections at various temperatures representative of the problem being solved and interpolating the cross section on this temperature mesh. These broadened cross sections are most frequently calculated using the SIGMA1 method [1][2]. While the broadening itself is exact, the interpolation between temperature points carries an amount of error, depending on the fineness of this mesh [3][4]. Additionally, storing tabulated cross sections at many temperatures results in increasingly large memory requirements.
An alternative to pre-computed cross sections is on-the-fly Doppler broadening. With on-the-fly broadening, cross section data is only stored at the coldest temperature in the problem, and the broadening operation is performed at the time of cross section lookup. With this on-the-fly scheme, the cross section data is exact to the degree of the specific broadening operation used, and only one set of cross sections needs to be stored (at the coldest temperature). However, the operations involved in broadening are typically expensive. Other methods have been discussed for on-the-fly broadening which are less computationally expensive, including the windowed multipole method [5] and a regression-based on-the-fly method [6].
In this paper, we explore the possibility of using general-purpose graphics processing units (GPGPUs or just GPUs) to perform on-the-fly Doppler broadening computations. GPUs excel at problems that can be expressed as a set of massively many smaller tasks. This reflects the underlying architecture in the device itself. Whereas traditional central processing units (CPUs) have the hardware capability to operate with up to tens of threads at a time, GPUs have hardware to operate with tens of thousands of threads at a time. It is thus of interest to investigate whether the operations involved with Doppler broadening can be expressed in this way. This paper presents two Doppler broadening algorithms that were implemented on a GPU. The first is an implementation of the standard SIGMA1 method [1] optimized for GPU computation. The second is a novel method that relies on the fact that the Doppler broadening kernel used in SIGMA1 computation is fundamentally a convolution and leverages an existing fast convolution method in an attempt to expose additional parallelism in the operation, thus making it more suitable for computation on a GPU.

THE SIGMA1 DOPPLER BROADENING ALGORITHM
We begin by briefly reviewing the SIGMA1 algorithm. The analytical form of the Doppler broadening equations for a cross section at broadened temperature from [1] is where the variable transformations have been applied. The standard method implemented in this study is the SIGMA1 method [1] as implemented in NJOY [2]. This method operates on linearly-interpolable tabulated cross sections at reference temperature (4) With this notation, the Eq. (1) can be expressed as (from [2]) * = 456 . 7 . − / . . 8 + 9 . / . : where the index variable i spans all cross section data points in the energy range [y -4, y + 4] and the 6 . , 9 . terms can be expressed as and = C,. are given as A recent demonstration mini-app [7] proposed a slight re-ordering of operations to enable compiler autovectorization, which effectively groups together calculations of the D C functions: * where 6 . * , 9 . * are expressed as This re-ordered algorithm is used for the implementation described in this paper.

NOVEL VECTORIZED METHOD OVERVIEW
The SIGMA1 method (and its re-ordering) is parallelizable to the degree that there is an independent set of operations for each reference cross section data point in the relevant data range. By leveraging a fast convolution method [8], an additional level of parallelism can be exposed, breaking each set into a smaller collections of independent operations. First, we observe (like some other modern methods, as in [9]) that the Doppler broadening equation (Eq. (1)) can be cast as a convolution operation, * , = 1 When expressed in this form, it is possible to leverage certain properties of convolution integrals. For instance, Reference [8] describes how the convolution of two piecewise polynomial functions can be quickly computed via a conversion to "delta function integral representation" -an alternative way of representing such functions. This technique can be used if both Σ ; , and M can be cast as piecewise polynomials. Cross section data are already given in piecewise linear form, which becomes piecewise quadratic under the transform given by Eq. (2): Reference [8] describes how piecewise polynomials like this can be converted to an equivalent delta function representation. This single representation is valid over the entire domain of the function, and aids in the computation of the convolution. Performing this conversion yields the representation of Σ ; , : where, from [8], the triplet notation is defined Recalling the definitions of Q U and R U from Eq. (11) it is readily apparent that Also, note that when 0 $ , kl 1 = 0, From this, Eq. (12) reduces to In order to apply the method given in [8], M needs also to be expressed as a piecewise polynomial. A piecewise step decomposition that preserves the integral between any two mesh points was used: With Eqs. (16) and (18), the convolution can be computed analytically by using the identity The resultant convolution of Σ0 ; x , kl 1 and M is then The domain of this convolution is (0,∞]. However, in practice, it is preferable to truncate the convolution evaluation to a smaller region of interest (typically from y -4 to y + 4). Outside of this truncation range, the cross section is set to zero. With this, the vectorized Doppler broadening method proceeds as follows: Pre-calculation -can be performed at time of cross section data import 1. Convert (energy, cross section) data point pairs to the Ψ U form in Eqs. (11) and (20) and store. These data points are independent of temperature and can be used in a broadening operation to any target temperature greater than the reference temperature. 2. Select and store a set of P mesh points and construct and store the Γ . data as described in Eqs. (17) and (20). In order to match up with the (y -4 to y + 4) importance region from which the cross section is evaluated, mesh points only need span the region [-4, 4].

GPU IMPLEMENTATION DETAILS
This study implemented both the SIGMA1 method and novel vectorized methods as CUDA kernels. All cross section data was read and stored in global memory on the GPU. The lookup parameters (i.e., lookup energy, temperature, etc.) are assembled into buffers on the CPU, then sent to the GPU at the time of kernel dispatch. In order to make the overhead associated with data transfer and kernel launch small with respect to total runtime, lookups were batched and dispatched all at the same time. In order to hide data transfer latency, these batches were further broken up into inner batches, to allow data transfer to happen concurrently with computation.

SIGMA1 Implementation
The CUDA implementation of the SIGMA1 broadening operation leverages the fact that each iteration of the summation in Eq. (8) can be computed in parallel and that the local results can be reduced at the end. However, typical CPU implementations of SIGMA1 use a method to start in the middle of the data subset (the subset of the cross section data that spans [y -4, y + 4]) and traverse to the edges of the set, ensuring that the method uses only the data that is necessary. This kind of traversal poses a problem for a parallelized implementation, where a priori knowledge of the bounds of our dataset before computation is desired. Furthermore, searches for specific energy points in a cross section evaluation are comparatively slow relative to the speed of the broadening operation. Therefore, the SIGMA1 implementation searches for an approximate data subset, which is guaranteed to at least contain the required cross section data, using a lethargy mapping function.

Novel Vectorized Implementation
The CUDA implementation of the novel vectorized operation adds another level of parallelism to the broadening operation-parallelism in P-space (i.e., along the various P mesh points of the M decomposition). This allows for more threads to work on the same lookup. Similar logic as discussed in §4.1 applies to the searches for the bounds of the data subset.

TIMING RESULTS
To gauge the performance of utilizing GPUs for Doppler broadening, an exploratory mini-app was created with both CPU and GPU implementations of the traditional SIGMA1 and the new vectorized Doppler broadening algorithms. The objective of the mini-app was to assess cross section lookup throughput, in lookups per second, or LUP/s, as a function of implementation and lookup batch size. Comparisons of the two GPU implementations with respect to cross section lookups without Doppler broadening on a CPU, which is the standard practice in most Monte Carlo codes, as well as CPU SIGMA1 on-the-fly Doppler broadening, which represents the "worst possible case" of a performance hit when adding on-the-fly broadening to a transport simulation were of particular interest.
All implementations were run through the mini-app on a compute node with an 18-core (36-thread) Intel Xeon Gold 6140 Processor and 2 NVIDIA Tesla v100 GPUs (only 1 of which was used). This exercise compared raw throughput (in LUP/s) as a function of batch size, in lookups. It was expected that overall throughput is expected to increase as the number of lookups dispatched at once increases, due to the presence of fixed overhead costs. To investigate this behavior, lookup throughput was measured for various batch sizes for lookups in the resolved resonance range of the ENDF/B-VII.0 evaluation of U-238. These results are shown in Figure 1. Also plotted in Figure 1 is the CPU SIGMA1 throughput as well as the reference lookup throughput, which is the performance of cross section lookups without on-the-fly Doppler broadening.

Figure 1. Throughput (in LUP/s) of cross section lookups for U-238 as a function of batch size for the various cross section lookup methodologies implemented.
As seen in Figure 1, throughput increases for both the GPU SIGMA1 and vectorized method implementations as batch size increases, as expected, up to a batch size of around 10 4 . As batch size approaches 10 7 , both GPU on-the-fly Doppler broadening methods have throughput that matches or exceeds the reference lookup. This suggests that similar computational performance can be achieved with on-thefly Doppler broadening on GPUs as can be achieved through CPU cross section lookups without on-the-fly Doppler broadening. Figure 1 also indicates that the GPU SIGMA1 implementation outperforms the vectorized method. This is a counterintuitive result since the vectorized method is prima facie "more parallel" than SIGMA1. However, relative gains in parallelism with the vectorized method appear to be offset by the larger computational costs imposed by the method. In addition, some latencies in dispatching SIGMA1 broadening were hidden through batching GPU kernel dispatches, which can potentially expose further parallelism in SIGMA1 broadening.

SUMMARY AND CONCLUSIONS
This paper presents GPU implementations for the SIGMA1 Doppler broadening method as well as a novel vectorized algorithm that leverages special properties of the convolution integral. An exploratory mini-app was used to compare the computational performance of these implementations against CPU cross section lookups without Doppler broadening and a CPU SIGMA1 implementation. Numerical results from the mini-app demonstrate that GPU implementations of Doppler broadening outperform CPU implementations if cross section lookups are batched together prior to GPU dispatch. In fact, for large dispatch batch sizes, the Doppler broadening throughput on a GPU is similar to the CPU throughput for cross section lookups without on-the-fly Doppler broadening. A direct comparison of results for the two GPU broadening methods reveals that the SIGMA1 method consistently outperforms the newly-developed vectorized algorithm. Although the SIGMA1 algorithm is theoretically "less parallel" than the vectorized algorithm, and hypothetically less suitable for implementation on a GPU, this algorithmic parallel inefficiency can be effectively overcome by dispatching many SIGMA1 operations concurrently on the GPU. The results of this study demonstrate that Doppler broadening algorithms are well-suited for implementation on GPUs. Furthermore, preliminary results generated with a representative Monte Carlo mini-app suggest that offloading broadening operations to a GPU may enable on-the-fly Doppler broadening during Monte Carlo transport simulations without a significant decrease in performance relative to static temperature interpolation methods used in many contemporary codes.