Numerical experiments using deflation with the HISQ action

We report on numerical experiments using deflation to compute quark propagators for the highly improved staggered quark (HISQ) action. The method is tested on HISQ gauge configurations, generated by the MILC collaboration, with lattice spacings of 0.15 fm, with a range of volumes, and sea quark masses down to the physical quark mass.


Introduction
An important goal of lattice QCD flavour physics calculations is to find deviations from the predictions of the standard model of particle physics.To exploit configurations with physical pion masses requires speeding up the calculation of quark propagators and improved measurement techniques to reduce statistical errors.For example, the poor signal-to-noise ratio for the rho correlator at large times complicates the analysis of the hadronic vacuum polarization contribution to the anomalous magnetic moment of the muon [1].The eigenvalues and eigenvectors of the fermion matrix can be useful both for speeding up the calculation of quark propagators and creating correlators with reduced errors.The RBC collaboration has implemented techniques that require thousands of eigenmodes to be calculated [2,3].
There has been much reported progress in developing faster algorithms for the computation of quark propagators for both Wilson-like fermions and domain wall (and overlap) fermions.The time taken to compute quark propagators has been speeded up by factors of up to O(15) [4][5][6][7], using algorithms such as multigrid or domain decomposition, applied to lattice QCD.
Another way to speed up an inversion is to remove eigenvalues and eigenvectors from the matrix.This is known as exact deflation.The use of deflation has been used to speed up the inversion of sparse matrices [8,9] in applied mathematics.Wilcox [10][11][12] has reviewed the various deflation algorithms [13] used in lattice-QCD calculations developed before 2007.The xQCD collaboration [14] have reported speed ups of 20 to 80, using deflation in combination with other techniques for the calculation of quark propagators with the overlap formalism on configurations with domain wall sea quarks.
Lüscher [15,16] noted that the determination of the eigenvalues for deflation has a potential cost that grows as O(V 2 ) for a volume V.In principle inversion algorithms using multigrid or domain Speaker, e-mail: craig.mcneile@plymouth.ac.uk arXiv:1710.07219v1[hep-lat] 19 Oct 2017 decomposition should have a performance which is independent of volume.In practice the parameter dependence of the algorithms has to be tested.
There are close connections between matrix-inversion algorithms and the algorithms used to compute eigenvalues [17].For example, Stathopoulos and Orginos have developed an algorithm, called EIG-CG, which combines sparse matrix inversion with the determination of the eigenvalues [18].The original algorithm worked for hermitian systems.They tested it with anisotropic unquenched Wilson fermions.For single sources, they found a speedup of between 7 and 10 for the ensembles they tested.See [19] for a similar algorithm.
There has been much less reported progress in the inversion algorithms for staggered fermions.The staggered fermion operator is antihermitian, and thus the inversion algorithm generally used is conjugate gradient for the normal equations, exploiting even-odd pre-conditioning.The Wilson-like operators are nonnormal, and additional algorithms were developed to work with this type of matrix.The calculation of the inverse operator for overlap and domain wall uses a nested procedure.The performance of deflation techniques for the inversion of staggered operators could thus be very different to those for domain wall, overlap or Wilson theories.The main development in speeding up the inversion algorithms for staggered fermions was use of multimass inverters [20].After 20 years [21] work has restarted on using multigrid algorithms for staggered fermions [22,23], but algorithms are not yet available for QCD.
The determination of a large number of eigenmodes can be extremely costly.One way to amortize this startup cost is to reuse the eigenvectors many times, as, for example, by inverting the quarkfermion operator for multiple right hand sides [12].They can be used to compute the low-mode contribution to correlators from all-to-all propagators [24][25][26][27][28][29] treating only the high modes with stochastic corrections.Also difficult-to-estimate disconnected loops can be computed using eigenvectors and eigenvalues.Note that there are also newer techniques for reducing the noise [30] in lattice QCD calculations, which are not based on eigenmodes.

Eigenvalues of the improved staggered Dirac operator
Staggered Dirac operators, including the HISQ [31] operator, obey The massless staggered Dirac operator is antihermitian; hence, the eigenvalue spectrum is purely imaginary.
If f s is an eigenvector with eigenvalue iλ s , then f s is also an eigenvector with −iλ s .There are potential zero modes λ s = 0, related to the topology of the gauge fields in the continuum.The lattice approximation moves the eigenvalues from 0. The fermion matrix for the HISQ action used in the inverter is where D eo is the part of the fermion operator which connects the even and odd sublattices and m is the quark mass.
The following combination was used in the eigensolver.Unlike matrix inverters, the eigensolvers do not require that the matrix be positive definite.The odd part of the j-th eigenvector, (χ j o ) , can be reconstructed from the even part (χ j e ) of the j-th eigenvector,

Eigensolvers in lattice QCD
A key issue in the use of deflation is the performance of the algorithm used to determine the eigenvalues and eigenvectors.The Lanczos algorithm has been used from the very early days of lattice QCD to determine eigenvalues.A three-term recursion relation is used to find a tridiagonal matrix.Unfortunately, the rounding errors generate ghost eigenvalues.There are many improvements of the basic Lanczos algorithm involving various types of restarts and vector spaces, which are implemented in various software libraries or implemented in lattice QCD codes.For example the MILC code contains routines which compute eigenvalues using an accelerated conjugate gradient algorithm [32], as well as an implementation of the EIG-CG algorithm [18].
There are a number of external sparse eigensolvers libraries commonly used in lattice QCD calculations.For example: the MILC code can call the PRIMME (preconditioned iterative multi-method eigensolver methods) library [33], which uses the Jacobi-Davidson method.For this project we added an interface to the ARPACK library [34], which uses the implicitly restarted Arnoldi method (IRAM).A comparison [35] of a variant of the Lanczos algorithm with the accelerated conjugate gradient algorithm [32], found that Lanczos was better.Most of our work on this project has used either the PRIMME or ARPACK libraries to determine the eigenvalues.
The number of iterations required in solving linear equations using conjugate gradient is related to the condition number of the matrix.This condition number is crucial to understanding the performance of the conjugate gradient algorithm, and indeed the idea of exact deflation is based on reducing the condition number by removing the lowest eigenvectors.
The condition number of an algorithm also expresses how sensitive the output is to changes in the input.There are condition numbers for the determination of each eigenvalues and eigenvectors.Here we report some results (reviewed in [36] for the condition numbers of eigenvectors and eigenvalues for Hermitian matrices. If x i and λ i are estimates of the i-th eigenvector and eigenvalue of the matrix M, respectively.Then a residual can be computed r j =|| Mx j − λ i x j || 2 (7) with the convention || x j || 2 = 1.We use the notation λ j and x j for the true eigenvalues and eigenvectors.Then the error in the determination of a given eigenvalue is bounded: There is another relationship based on how close together the eigenvalues are, which may produce a tighter bound on the computed eigenvalues.The computed eigenvectors of a Hermitian matrix are much more sensitive to input conditions than eigenvalues.
The angle θ is defined by where z is a vector orthognal to x j .This is the basis of the heuristic statement that it is cheaper to compute the eigenvalue spectrum if the eigenvalues are widely separated.
To speed up the computation of the eigenvalues we are experimenting with polynomial acceleration [2,24,38] A polynomial of the matrix (p(x)) is used in the eigensolver.The idea is to suppress the unwanted part of the eigenvalue spectrum and to spread out the required part of the spectrum.Chebyshev polynomials are a popular choice.They are of order 1 between -1 and 1, but grow rapidly outside this region.In the simplest scheme, the unwanted eigenvalues are mapped to lie between -1 and 1. See the Chebyshev polynomials in figure 1. Sorensen and Yang [39] have investigated using polynomial approximations to step functions in the eigensolver, but found that the use of Chebyshev polynomials gave superior results to other polynomials they tested.
We have tested a polynomial proposed and investigated by Zhou and Saad [37].Consider a matrix M with eigenvalues [37]  2 and c = (b+a) 2 .The iterative scheme for the polynomials is below Zhou and Saad [37] introduced a modification to the above iterative scheme.They introduce a scaling factor ρ j = C j ( 2 e (a 0 − cI)), so that the eigenvalues in the range [a 0 , a] are mapped close to 1, and the eigenvalues in the range [a, b] are mapped close to 0. In figure 2, we plot polynomials of different orders suggested by Zhou and Saad [37] for target eigenvalues in the range 0.0 to 0,01.

Results from numerical experiments
The goal is to solve the linear equations for x, given b and the fermion matrix M. The conjugate gradient algorithm reduces the norm of the residual (r i = Mx i − b) at every iteration, and hence finds x.
Figure 3 shows the square of the residual of the CG inverter as a function of the number of iterations as the number of deflated low eigenmodes is varied.As expected, deflation reduces the number of iterations required to get to a given residual.Progress stops when the residual reaches a value comparable the accuracy of the eigenmodes.Figure 4 shows the residual as a function of iteration when 100 eigenmodes are deflated.The eigenmodes are determined with varying accuracy using the PRIMME eigensolver.
An important issue is the performance of the eigensolver, because it determines the setup cost for exact deflation.The timings for the ARPACK eigensolvers are plotted in figure 5. We are still tuning the Jacobi-Davidson algorithm in the PRIMME library.For example, the paper [39] uses polynomial acceleration with a Davidson algorithm for a nonlattice QCD application.We did not optimize the polynomials used in the acceleration procedure.We stopped tuning the parameters of the algorithm once we obtained residuals around 10 −14 with polynomial acceleration, so a further reduction of time can probably be made.
Figure 5 shows the preliminary time of determining different numbers of eigenmodes with different algorithms.The tests were run on 64 nodes (Intel Sandy-Bridge cores).The figure shows the dramatic speed up of the eigensolver when polynomial acceleration is used.We are still tuning the various polynomials used and the performance of the algorithm in the ARPACK library.The figure also shows the increase in time to compute the eigenvalues required as the volume is increased from 16 3 × 48 to 32 3 × 48.The performance of the eigensolver depends on the separation between the eigenvalues, so it is useful to study the first O(1000) eigenvalues.The majority (apart from [42].) of the studies of the eigenvalues of the lattice Dirac operators has focused on the smallest eigenvalues, because these are related to topology.
In figure 6 we plot the the eigenvalues of D eo (computed from the square root of the eigenvalues of D eo D † eo ) on the three volumes.The magnitude of the eigenvalues is approximately linear against the mode number.A simple scaling of the bulk eigenvalues with 1  V for the space-time volume V doesn't map the eigenvalues onto a universal curve, so more study is required of the volume dependence.A log scale on the y-axis would help to reveal the potential zero modes.Follana et al. [43] argued that the near-zero topological eigenmodes scale differently with volume than the bulk eigenmodes.
There are many places in lattice QCD calculations where the computation of thousands of eigenmodes are required, either in speeding up the calculation of propagators, or in the design of better measurement techniques.Our experience has been that a lot of tuning is required to get even reasonable performance from an eigensolver.It would be useful to have a better understanding [17] of how to construct improved polynomials to use with an eigensolver.

Figure 1 .Figure 2 .
Figure 1.Two Chebyshev polynomials in the range [a 0 , b].The polynomial is designed to suppress the eigenvalues in the range: [a, b], where a > a 0 .Define e = (b−a)

2 residual
Figure Square of the residual as a function of iteration with a number of eigenvalues projected out.

Figure 4 .
Figure 4. Square of the residual as a function of iteration with a 100 eigenvalues projected out.The eigenvalues are determined with different accuracies.