Multigrid accelerated simulations for Twisted Mass fermions

Simulations at physical quark masses are affected by the critical slowing down of the solvers. Multigrid preconditioning has proved to deal effectively with this problem. Multigrid accelerated simulations at the physical value of the pion mass are being performed to generate Nf = 2 and Nf = 2 + 1 + 1 gauge ensembles using twisted mass fermions. The adaptive aggregation-based domain decomposition multigrid solver, referred to as DD-αAMG method, is employed for these simulations. Our simulation strategy consists of an hybrid approach of different solvers, involving the Conjugate Gradient (CG), multi-mass-shift CG and DD-αAMG solvers. We present an analysis of the multigrid performance during the simulations discussing the stability of the method. This significant speeds up the Hybrid Monte Carlo simulation by more than a factor 4 at physical pion mass compared to the usage of the CG solver.


Introduction
Simulations at the physical value of the pion mass have been intensively pursued by a number of lattice QCD collaborations. In order to accomplish these simulations speeding up of the solvers is an essential step. A successful approach employed is based on multigrid methods used in preconditioning standard Krylov solvers. There is a number of variant formulations of highly optimized multigrid solvers, which yield improvements of more than an order of magnitude for operators at the physical value of the light quark mass, as reported in Refs. [1][2][3][4].
In this work, we focus on simulations with twisted mass (TM) fermions. This discretization scheme has the advantage that all observables are automatically O(a) improved when tuned at maximal twist [5]. This formulation is thus particularly suitable for hadron structure studies, since the probe, such as the axial current, needs no further improvement in contrast to clover improved Wilson Dirac fermions. Furthermore, the presence of a finite TM term bounds the spectrum of DD † from below by a positive quantity µ 2 , where D is the Wilson Dirac operator and µ is the TM parameter. This avoids exceptional configurations and, at the same time, gives an upper bound to the condition number, satisfying the convergence of numerical methods. Using this discretization approach enables us to study a wide range of observables. Both these simulations and the calculation of observables have substantially benefit from employing multigrid methods.  (5) 135.2 (7) 3128 N f = 2 + 1 + 1 64 3 × 128 0.0820 (4) 136.1(7) 3051 Table 1. We give the lattice volume V, the lattice spacing a, the pion mass m π and the number of molecular dynamics units (MD) measured in MD trajectories of the two ensembles discussed in this work.
Here we will show results for two simulations at maximal twist and at the physical value of the pion mass, which have been generated in the last two years using twisted mass fermions. The properties of these ensembles are listed in Table 1. Results of the N f = 2 simulation have been partially presented in Refs. [3,6], where a statistic of 2000 molecular dynamic units (MDU) has been used. The tuning procedure of the N f = 2 + 1 + 1 ensemble and some physical results are presented in Ref. [7].
Here we discuss in detail the HMC simulations with a focus on the usage of multigrid methods and in particular the performance of the DD-αAMG solver adapted for TM fermions [3]. We discuss our strategy for the calculation of the force terms and the generation of the multigrid subspace during the integration of the molecular dynamics (MD) and demonstrate that this yields stable simulations with an improved performance.

DD-αAMG method
The adaptive aggregation-based domain decomposition multigrid (DD-αAMG) method has been introduced in Ref. [2] as a solver for the clover-improved Wilson Dirac operator D. In the DD-αAMG method a flexible iterative Krylov solver is preconditioned at every iteration step by a multigrid approach given by the error propagation where M is the smoother, k are the number of smoothing iterations, P is the prolongation operator and D c = P † DP is the coarse Wilson operator. The multigrid preconditioner exploits domain decomposition strategies having for instance as a smoother the Schwartz Alternating Procedure (SAP) [8] and as a coarse grid correction an aggregation-based coarse grid operator. The method is designed to deal efficiently with both, infrared (IR) and ultra-violet (UV) modes of D. Indeed, the smoother reduces the error components belonging to the UV-modes [2], while the coarse grid correction deals with the IR-modes. This is achieved by using an interpolation operator P, which approximately spans the eigenspace of the small eigenvalues. Thanks to the property of local coherence [9] the subspace can be approximated by aggregating over a small set of N v O(20) test vectors v i , which are computed via an adaptive setup phase [2]. We remark that the prolongation operator in the DD-αAMG method is Γ 5 -compatible, i.e. Γ 5 P = PΓ 5,c . Thanks to this property the Γ 5 -hermiticity of D is preserved in the coarse grid as well, i.e. D † c = Γ 5,c D c Γ 5,c . The DD-αAMG approach has been adapted in Ref. [3] to the Wilson TM operator D(±µ) = D ± iµΓ 5 . Due to the Γ 5 -compatibility, the coarse operator reads similarly to the fine operator, i.e. D c (±µ) = D c ± iµΓ 5,c , and the same prolongation operator P can be used for both signs of µ. In N f = 2 + 1 + 1 simulations we use a three level DD-αAMG solver also for the non-degenerate TM operator where I 2 , τ 1 and τ 3 act in flavor space. Here the coarse operator is constructed with the same prolongation operator P of D(±µ) although it is used diagonally in flavor space, i.e. P ND = P ⊗ I 2 . Thus, the coarse non-degenerate TM operator is defined as D ND,c (μ,¯ ) = P † ND D ND (μ,¯ )P ND . In Fig. 1 we report the comparison of time to solution between CG and DD-αAMG solvers when the squared operator D † (µ)D(µ) is inverted at different values of µ. These inversions are computed with the DD-αAMG solver in two steps solving as first D(−µ) x = Γ 5 b and then D(µ) y = Γ 5 x. Thus, the computational cost for solving D † (µ)D(µ) y = b is double as compared to the cost for inverting D(±µ). At the physical value of the light quark mass the DD-αAMG solver gives a speed-up of more than two order of magnitude compared to CG solver. When it is used for the non-degenerate TM operator at the physical strange quark mass the speed-up is around one order of magnitude.

Characteristics of the simulations
The simulations hare produced by using the tmLQCD software package [10] and the DDalphaAMG library for TM fermions [6]. All the simulation codes are released under GNU license. The integrator is given by a nested minimal norm scheme of order 2 [11] with a nested integration scheme setup similar to previously produced simulations [12]. We used a three-level DD-αAMG method for the small mass terms, a CG solver for larger mass terms and a combination of a multi-mass shift CG solver and three-level DD-αAMG method to compute the rational approximation. The simulations are performed for the even-odd reduced operator. For the heat-bath inversions and acceptance steps we require as stopping criterion for the solvers the relative residual to be smaller than 10 −11 . For the force terms in the MD trajectory the criterion is relaxed, using 10 −7 for the CG solver and 10 −9 for the DD-αAMG method. This ensures that the reversibility violation of the MD integration is sufficiently reduced. Note that the usage of the multigrid method is efficient if the subspace can be reused at larger integration time. In general, this yields a larger reversibility violation. By choosing a higher accuracy i.e. by using a smaller stopping criterion, for the multigrid solver the reversibility violation can be reduced. We check that with the values mentioned above the reversibility violations are compatible with the case where a CG solver is used for all the monomials.

N f = multigrid accelerated HMC simulations
In the N f = 2 simulations we employ the Hasenbusch mass preconditioning [13] by introducing additional mass terms and split up the determinants into additional ratios as in the following In our N f = 2 simulation given in Table 1 from the largest shift to the smallest. As depicted in Fig. 2, this yields a stable simulation without large spikes in the energy violation δH with an acceptance rate of 84.5%. For the Hasenbusch mass preconditioning we use the shifts µ + ρ i depicted in Fig. 1. The DD-αAMG method is faster than CG solver for all the shifts except the largest given by ρ 3 . Thus we have used DD-αAMG for the inversions with shifts µ, µ+ρ 1 and µ+ρ 2 . The DD-αAMG iterations count averaged per MD trajectory is depicted in Fig. 2. No exceptional fluctuations or correlations with larger energy violation δH are seen along the simulation. The stability of the multigrid method is ensured by updating the setup every time the inversion at the physical quark mass, i.e. (Q 2 + µ 2 ) −1 , is performed. The update is based on the previous setup by using one setup iteration, which is possible due to the adaptivity in the DD-αAMG method [14] for the definition of the setup iteration. At the beginning of the trajectory we perform three setup iterations. The final speed-up, including setup costs, is a factor of 8 compared to CG in N f = 2 simulations at the physical pion mass.

N f = 2 + 1 + 1 multigrid accelerated (R)HMC simulations
In N f = 2 + 1 + 1 simulations we follow the same prescription reported in the previous section. Additionally to the determinants in the rhs of Eq. 3, the determinant of the non-degenerate N f = 1 + 1 TM operator D ND in Eq. (2)   which can be retrieved for the non-hermitian operator D ND by using the Rational HMC (RHMC) [15].
Here the determinant is rewritten as where Q ND = (Γ 5 ⊗ τ 1 )D ND = Q † ND is the hermitian version of D ND . The term R ND is the optimal rational approximation of 1 Q 2 ND R ND ≡ R n, Q 2 ND = a n, where n is the order of the approximation and = λ min /λ max is fixed by the smallest and largest eigenvalue of Q 2 ND , λ min and λ max , respectively. The coefficients a n, and c n, ,i are given by the Zolotarev solution for the optimal rational approximation [16]. The rational approximation is optimal because the largest deviation from the approximated function is minimal according to the de-Vallée-Poussin's theorem. In general, one can relax the approximation of the square root by introducing a correction term det Q ND R ND as in the rhs of Eq. (4). It takes into account the deviation from |Q ND | being close to the identity as much as the rational approximation is precise. For this reason we include it only in the acceptance step.
In our N f = 2 + 1 + 1 simulation with properties given in Table 1, we use a rational approximation of order n = 10, which has a relative deviation such that |Q ND | R ND ∞ < 1.4 · 10 −6 , considering the eigenvalues of Q 2 ND in the interval λ min = 6.5 · 10 −5 and λ max = 4.7. The product of ratios in the rhs of Eq. (5) is split in four monomials R 1−4 . The first two contain three shifts, the second two contain two shifts. We use a 6 level nested minimal norm second order integration scheme with integration steps N int = {12, 36, 108, 324, 972, 2916}. The four monomials R 1−4 are placed one by one in the four outermost time-scales. As depicted in Fig. 3, this yields relative small energy violation during the  Table 1. The N f = 2 + 1 + 1 timings also include the calculation of the smallest and largest eigenvalue done at least every ten trajectories. The large fluctuation around MDU 500 for the N f = 2 + 1 + 1 ensemble is due to a bad allocation in the machine (it lasts exactly the duration of the allocation).
MD integration and an acceptance rate of 76.8%. In the same figure we depict the iterations count for the inversions done with DD-αAMG. The setup update is done on the second time-scale for the shift µ + ρ 1 , since it is still close to the light quark mass. In this case, we find slightly larger fluctuations in the iteration count of the outer level of the multigrid solver compared to the N f = 2 simulation shown in Fig. 2. Overall the simulation is stable and we find no correlation of the iteration count with larger energy violations δH. The solutions computed by DD-αAMG for the shifted linear systems in the monomials R 3 and R 4 are accelerated by providing an initial guess. Considering a Taylor expansion, we obtain Thus we can use the inversions of the previously computed shift, i.e. Q −1 ND and Q −2 ND , for constructing an appropriate initial guess for the next shift. This saves up to 30% of the inversion time.
The computational cost of each monomial per MD trajectory is depicted in Fig. 4. The standard approach depicted in the left panel involves the employment of multi-mass-shift CG (MMS-CG) for inverting all in once the shifted linear systems. In the right panel we have depicted the costs of the simulation accelerated by using the DD-αAMG solver for the most ill-conditioned linear systems. We achieve a speed-up for the N f = 1 + 1 sector of a factor of 2 compare to a full eo-MMS-CG algorithm. The overall speed-up for N f = 2 + 1 + 1 simulation at the physical light, strange and charm quark mass is a factor of 5.

Conclusions
The multigrid accelerated simulations with TM fermions are discussed, showing a speed-up for simulations at the physical value of the pion mass of a factor of 8 in the case of N f = 2 and of a factor of 5 in the case of N f = 2 + 1 + 1 compared to the simulations performed without multigrid. We use a three-level DD-αAMG method optimized for TM fermions [3]. No instabilities in the iterations count are seen along the whole simulations. As depicted in Fig. 5, the time per MDU is quite stable with fluctuations within the 10%. During the N f = 2+1+1 simulation we calculate the smallest and largest eigenvalue of the non-degenerated TM operator at least every ten MDUs, which takes additionally 15 mins. This explains the single points that are frequently out of the main distribution. The two longer fluctuations instead are due to machine instabilities since they are limited to a single allocation of the job. Although the multigrid method is limiting significantly the parallelization of the calculation, the speed-up achieved makes feasible to sample enough MD trajectory. Indeed, the average time per MDU in the N f = 2 simulation is slightly larger than an hour and in the N f = 2 + 1 + 1 simulation below three hours. In both cases, 4096 cores employing 147 Haswell-nodes on SuperMUC are used. Indeed, we find that the three-level DD-αAMG method shows an almost ideal scaling up to this number of cores for a lattice with volume V = 64 3 × 128.