Single flavour filtering for RHMC in BQCD

Filtering algorithms for two degenerate quark flavours have advanced to the point that, in 2+1 flavour simulations, the cost of the strange quark is significant compared with the light quarks. This makes efficient filtering algorithms for single flavour actions highly desirable, in particular when considering 1+1+1 flavour simulations for QED+QCD. Here we discuss methods for filtering the RHMC algorithm that are implemented within BQCD, an open-source Fortran program for Hybrid Monte Carlo simulations.


Introduction
While the lattice QCD community has achieved dynamical simulations at or near the physical pion mass [1][2][3][4], the Monte Carlo generation of gauge field configurations with light quarks presents a numerically intensive challenge, and is only possible for groups with access to primary tier supercomputing resources. After several decades, the Hybrid Monte Carlo (HMC) algorithm [5] remains the method of choice for dynamical QCD simulations.
The leading expense in modern dynamical simulations is due to the fermion determinant. To enable the use of Monte Carlo methods, the Grassmannian fields which represent the quarks are integrated out analytically, where M f is the fermion matrix for the quark flavour f. At this point the resulting fermion matrix determinant can be estimated stochastically through the introduction of pseudofermions, The pseudofermion fields φ are standard complex numbers and are therefore computable. The simplest pseudofermion formulation is for two degenerate quark flavours, where we have taken advantage of the fact that the determinant of the fermion matrix M is real. In this instance, the determinant is represented by a pseudofermion action with a positive semi-definite matrix, where K = M † M. The positive semi-definite property enables the generation of pseudofermion fields that satisfy the probability distribution ρ(φ) ∼ e −S 2 f (φ) . Single flavour fermion simulations pose additional computational challenges. For Wilson-type actions, the fermion matrix has a complex spectrum, and hence cannot be used directly in the pseudofermion distribution without additional modification. Noting that, if det M is real and positive we have The Rational HMC (RHMC) algorithm [6,7] solves this problem by replacing the fermion matrix M with a rational approximation to the inverse square root, where K = M † M is positive semi-definite, such that the pseudofermion action for a single flavour is Typically, R(K) is chosen to be the Zolotarev approximation, which is the optimal rational polynomial approximation for R(z) z − 1 2 . Due to isospin symmetry being very nearly exact, in pure QCD simulations the up and down quark are almost always chosen to be degenerate. This means that in a 2 + 1 flavour simulation only the strange must be treated using RHMC. Historically, the cost of adding the strange quark to a dynamical simulation with already light up and down quarks was relatively small. However, there have been a large number of algorithmic improvements focused on reducing the cost of two flavour simulations, such that in a modern dynamical simulation the cost of the including the strange quark is significant.
Moreover, the precision of lattice QCD calculations of certain quantities has advanced to the point that electromagnetic effects can no longer be neglected, see e.g. [8][9][10][11][12]. In these QCD+QED calculations, isospin symmetry is explicitly broken as the up and down quarks have different charges q u q d . This means that at the physical point, 1+1+1 flavour simulations are needed [12], noting that the down and strange quark have different masses m d m s . This motivates us to seek improvements to the single flavour RHMC algorithm that parallel the advances made for the degenerate two flavour case.

Filtering algorithms
The essential principle underlying the Hybrid Monte Carlo algorithm is the introduction of a fictitious simulation time co-ordinate, along with the corresponding conjugate momenta and a Hamiltonian describing the evolution in this new simulation time. The molecular dynamics evolution approximately preserves the extended Hamiltonian and this ensures a high acceptance rate for the global Metropolis accept/reject step that takes place at the end of each trajectory.
There are two sources of computational expense in performing simulations at light quark mass. Both stem from the fact that the fermion determinant must be represented stochastically using pseudofermions with a non-local action that features the inverse of the fermion matrix. The first is that the condition number of the fermion matrix increases at light quark mass, causing a large increase in the number of conjugate gradient iterations required to solve the linear system. Solving this system is required at each molecular dynamics step, which means the numerical expense of the inversions is the dominant cost for generating dynamical gauge fields. The second is that the molecular dynamics step-size must be decreased as the quark mass is reduced in order to maintain control over the rapid fluctuations driven by the pseudofermion field.
Splitting the pseudofermion action using a filter provides the opportunity to address both these issues. The application of a filter F to the fermion action kernel L results in a pseudofermion action with multiple terms, A good choice of filter can have a two fold effect. F −1 should act as a preconditioner for L, such that the cost of evaluating the molecular dynamics force is reduced. The second benefit comes from the ability to place the evolution of the two terms on different time scales, such that the expensive fermion matrix inversions are performed less often per trajectory.
There are a number of filtering techniques [13][14][15][16][17][18] that have been developed for two flavour simulations, two of which are most relevant to this discussion. Polynomial filtering [13] uses a polynomial approximation to the inverse as a preconditioner for L, with the advantage that a short polynomial filter can simultaneously capture the high frequency dynamics whilst being cheap to evaluate. Utilising appropriately factored polynomials allows for additional time scales to be introduced. Mass preconditioning [14,15] uses the lattice Dirac matrix at heavier value of the quark mass parameter as a filter, and again it is possible to create a tiered molecular dynamics integration scheme by using successively increasing values of the hopping parameter κ, with simulations at the physical quark mass using four or more Hasenbuch terms. The possibility of combining mass preconditioning and polynomial filtering was recently shown to remove the need to fine tune the Hasenbuch parameter(s) [19].
In this work we propose two classes of filter that can be applied to the Rational Hybrid Monte Carlo action, Analogously to the two flavour case, we can construct a polynomial filter for RHMC by choosing a low order polynomial P(K) K −1/2 that approximates the inverse square root, such that Given a class of polynomials P(K), such as the Chebyshev approximations, we only need to tune the polynomial order p. The polynomial term P(K) is cheap to evaluate. So long as the roots of P(K) are sufficiently away from zero then we can obtain the filtered rational polynomial P(K) −1 R(K) needed for the remainder term with minimal additional expense by using a multi-shift solver, and the filter will reduce the force associated with the remainder term such that it can be placed on a coarser integration scale. The second type of filter we propose is based on the observation that the class of Zolotarev rational polynomials R(K) K −1/2 can be written down in an ordered product form, where a k , b k , d n > 0, and the coefficients in the numerator and denominator strictly decrease as k increases such that a k > a k+1 , b k > b k+1 . Table 1 provides the coefficients for the n = 20 Zolotarev approximation R(z 2 ) optimised for the range z ∈ [5 × 10 −5 , 3] as an example. The ordered product can where the overall coefficient d n is only present if i = 0. The value of this partitioning is revealed when we observe that truncations of the ordered product R 0,t (K) at order t < n also provide an approximation to the inverse square root, This property is demonstrated in Figure 1, where we plot the functions zR 0,t (z 2 ) 1 for z > 0, where R 0,t are the truncations of the n = 20 Zolotarev approximation specified in Table 1. We see that for all choices of t the approximation is good at large values of z, while the range over which the approximation is valid gets successively better as the truncation order t increases. We observe at t = 16 the approximation error is negligible over almost the entire range of z. Hence we have shown that truncations of the ordered product successfully capture the high frequency dynamics. Next we consider the correction term which is simply the remaining factors associated with a given truncation order t, Note that, due to the ordering of the product, for values of k >> 1 we have that a k b k are both small, and as the order of the polynomials in the numerator and denominator are equal, the correction term is an approximation to unity. This is very clearly demonstrated in Figure 2, which shows the correction terms R t,n (z 2 ) arising from different choices of the truncation order t for the Zolotarev approximation from Table 1. The bounds on the vertical axis are set at a 10% deviation from 1. Even at t = 8, the correction term is approximately one over a large range, while at t = 16 the remainder is indistinguishable from unity at the displayed scale. This is significant, as it implies that the correction term will have a correspondingly small molecular dynamics force and thus can be placed on a coarse integration scale. Based on the above observations, we define the truncated ordered product (top) filtering scheme for RHMC, denoted by tRHMC, such that is the tRHMC pseudofermion action. Here we point out a significant advantage of the tRHMC scheme is that, as both terms are simply rational polynomials, it is very easy to implement using preexisting RHMC code. It is also worth noting that partitioning the ordered product form of the rational polynomial is distinct from partitioning the sum over poles, in which different poles in the expansion of R(K) are simply placed on different time scales. The reason is that the sum over poles simply distributes the inversions based on their expense, and does not share the beneficial properties of the tRHMC scheme mentioned above, namely that by also approximating K − 1 2 , R 0,t acts as a high pass filter. A further computational efficiency for the tRHMC scheme may be realised on some hardware architectures due to the memory bandwidth benefits of evaluating multiple small order rational polynomial terms, rather than a single term of full order. t=16 t=12 t=8 t=4 Figure 2. The correction term R t,n (z 2 ) 1 for various values of the truncation order t. The remainders R t,n derive from a Zolotarev rational polynomial R(z 2 ) 1/ √ z 2 of order n = 20 over the range z ∈ [5 × 10 −5 , 3]. The graph demonstrates that truncating the ordered product yields remainders that approximate unity.
As has been done in the two flavour case, hierarchical filtering schemes can be defined for the single flavour filters that we have proposed. In the case of PF-RHMC, we can add a higher order polynomial Q(K), (q > p) which approximates P(K) −1 K −1/2 as a secondary filter, which results in the 2PF-RHMC action, For the tRHMC scheme, we can add a second truncation point in the order product, such that the 2tRHMC action is given by

Characteristic scale tuning
Using a fully generalised integration scheme [13,19], it is possible to place each term in the molecular dynamics action on an independent scale. Typically, the step sizes h i associated with a given action term S i can be tuned by force balancing, where given some measure F i of the molecular dynamics forces, for each term we attempt to balance the product of the step size and the force such that F i h i ∼ constant. Here, F i can be the average or maximum force associated with a given term. The maximum tends to prove a better choice, as it is more closely linked with the force variance. However, in the case of tRHMC, for higher order truncations, the force due to the remainder term is very small. This makes the force-balanced approach to tuning the step sizes unfavourable. Noting that for filtered actions the acceptance rate is primarily determined by the coarsest scale, we adopt an alternative approach to tuning using the characteristic scale. The coarsest scale h 0 is tuned by 2tRHMC Figure 3. The cost function C = N mat /ρ acc for the different algorithms used in this study. The red point is for vanilla RHMC, while the blue points are for the filtering schemes using characteristic scale tuning. From left to right, first we have the single filter results at various values of their respective order parameters, namely PF-RHMC for different polynomial order p, and tRHMC for different truncation order t. Next we have the two filter results, with 2PF-RHMC at fixed p = 4 and various q for the polynomial orders, followed by 2tRHMC at fixed t 1 = 4 and various t 2 for the truncation orders.
fitting the acceptance probability with the complementary error function, ρ acc ≈ erfc(h 2 0 c 2 ), where the characteristic scale c is the only fit parameter. With the coarsest scale h 0 excluded, the remaining scales h 1 > h 2 > h 3 . . . are set using the force balancing relation to achieve the target acceptance rate.

Results
All runs were performed using the open source BQCD software package [22,23]. Both the PF-RHMC and tRHMC filtering methods are built into the latest release, which is available at the following link: www.rrz.uni-hamburg.de/services/hpc/bqcd To compare the different filtering algorithms proposed herein, we start with a thermalised 16 3 × 32 lattice with N f = 2 flavours of Wilson fermions from a previous study [19]. The Wilson gauge action is used with β = 5.6, with hopping parameter κ = 0.15825, providing a pion mass of m π ∼ 400 MeV and a lattice spacing of a ∼ 0.08 fm.
The various RHMC simulations are then performed with (degenerate) 1 + 1 single-flavour pseudofermions. As both flavours are treated equally, they have degenerate force and fermion matrix multiplication count distributions. We tune the step-sizes using the characteristic scale method above such that the acceptance rate ρ acc ∼ 0.75. The rational polynomial used is the Zolotarev approximation with order n = 20 and range [5 × 10 −5 , 3] given in Table 1. The polynomial filters used in the PF-RHMC runs are the Chebyshev approximations to K −1/2 (or P(K) −1 K −1/2 in the two filter case) with range [5 × 10 −5 , 3]. The cost function we use to compare our methods is the same as in [19], namely where the matrix multiplication count N mat is the total for both flavours. Figure 3 compares the cost function for the various filtering methods described herein. Polynomial-filtered RHMC provides a small benefit over vanilla RHMC. It is clear that tRHMC provides the best performance improvement, at a speedup of around 1.75. The optimal point for tRHMC and 2tRHMC is similar, but the two truncation form is less sensitive to the choice of the truncation parameter.
In summary, we have seen that the tRHMC filtering scheme coupled with characteristic scale tuning works well cross a wide range of truncation orders. The tRHMC algorithm is provided by BQCD, but is also very easy to implement in existing RHMC codes. Multiple truncation filters can be used, and this should prove to be useful at light quark masses. The full details of our study will be published elsewhere.