Lattice Study of QCD Properties at the “Govorun” Supercomputer

This report is devoted to Lattice QCD simulations carried out at the “Govorun” supercomputer. The basics of Lattice QCD methodology, the main Lattice QCD algorithms and the most numerically demanding routines are reviewed. We then present details of our multiGPU code implementation and the specifics of its application on the “Govorun” architecture. We show that implementation of very efficient and scalable code is possible and present scalability tests. We finally relate our program with the goals of the NICA project and review the main physical scenarios that are to be studied numerically.


Introduction
The Quantum Chromodynamics (QCD) is the part of the Standard model describing such elementary particles as quarks and gluons. The main feature of the QCD is the confinement/deconfinement phase transition [1]. In short, at high energies (high temperature T or high particle density n) the quarks and gluons are "good" degrees of freedom and the quark-gluon matter can be considered as consisting of independent quarks and gluons mildly interacting with each other. However, at low energies the quarks and gluons form hadrons (protons, neutrons, etc.) which become the relevant degrees of freedom instead of the independent quarks and gluons. This significant change in the relevant degrees of freedom is of paramount theoretical importance and is being studied by many researchers.
Even though it is believed that QCD is the correct theory to describe the strong sector of the Standard model, the QCD Lagrangian is very complicated and notoriously strongly coupled. The α s constant describing the interaction strength between quarks and gluons is large (α s ∼ 1) in many physically relevant scenarios. This does not allow the expansion in α s series and the derivation of experimentally observable predictions analytically from the QCD first principles. However, the lattice simulations of QCD enable the study of a lot of QCD properties using modern supercomputers [2].
Currently, several heavy-ion collision experiments such as NICA, RHIC, LHC and FAIR are aimed at the experimental study of the QCD phase diagram in the temperature-baryon density (T, µ) plane 1 . As already noticed, the only reliable way based on the first principles to the study of the QCD properties in the (T, µ) plane is provided by the lattice simulations of QCD (LQCD). The method is based on the Feynman path integral technique and is widely known to be reliable. Many experimental observables were confirmed by LQCD direct computations. Due to LQCD we know a lot about the thermodynamic properties of the QCD at zero or small baryon density [3,4].
Although the LQCD method is reliable in terms of controllable approximations, it is numerically very demanding and requires novel algorithms and hardware. In the recent years it was noticed that GPU (Graphics Processing Unit) can be utilized to significantly speed-up the LQCD simulations. One of the last trends in LQCD is the application of multiGPU techniques. In this paper we will review the multiGPU algorithms that we have implemented. We will discuss the basic speed-up techniques and the multiGPU specifics. We will also discuss the perspectives of the "Govorun" supercomputer in case of multiGPU simulations and perform a scalability analysis of our code. Lastly, we will discuss the main projects that require the multiGPU code and their relation to the NICA collider.

Lattice action and discretization
The QCD Lagrangian is well-known and gives accurate analytical predictions where possible. However, in many cases direct QCD calculations from the first principles are only possible in the LQCD. In LQCD [5] one starts with the introduction of discrete coordinates x µ (n ν ), n ν ∈ 0, 1, 2, . . . called the Lattice discretization. The fields are now defined only at the finite number of lattice sites. Usually the lattice has L s ∼ 64 steps in every spatial direction and L t ∼ 64 sites in the temporal direction. The introduced lattice has a lattice spacing a and in the continuum limit a → 0 should give the experimentally observable predictions. The discretized LQCD action reads is the plaquette,D(U) is the Dirac operator and Λ is the whole set of L 3 s × L t lattice sites. In the Lattice version of the QCD Lagrangian, the gluonic fields in SU(3) algebra A µ (x) become the SU(3) group U µ (x) = exp iaA µ (x) , while q(x) are the 3-values complex columns. In the continuum limit the terms above reproduce the well-known expressions: The Lattice QCD studies the system in thermal equilibrium. The main object of interest is the partition function which contains valuable information about the system properties: The partition function Z contains integration over the continuum set of variables which turns out to be the finite set of L 3 s × L t variables within the lattice discretization. The Grassmanian anticommuting variables q(x) can be integrated out analytically to give where the det operation is taken over an operator of the size L 3 s × L t × L 3 s × L t and f denotes the present quark flavors. Although the above computation is infeasible on the today computers (we shall discuss that later), we can now review the assumptions made, the approximations and the whole computation pipeline. First, one computes the partition function Z(µ, T, a, m f , L s , L t ). In order to reproduce the physical result, one should perform the continuum extrapolation a → 0 and the infinite volume extrapolation L s , L t → ∞. The uncertainties of the computation can always be reduced to the required level by performing longer simulations. Thus, the method keeps all the uncertainties and approximations under control.
The next step of the methodology is getting rid of the fermionic determinant det D f (U) + m f to allow numerical study. This is done by application of the well-known Gaussian integration formula is literally the action of the matrix on the vector and then the scalar product withφ. The partition function now reads In spite of the fact that we can now simulate this partition function on the lattice, one has to perform numerous inversions of the very large and sometimes badly-conditioned D (U) + m operator. This is done by the application of the conjugate gradient method.

The Hybrid Monte Carlo (HMC) algorithm
The final expression for the partition function in the typical contemporary lattice simulation contains ∼ 10 8 integration variables. The integral is estimated stochastically within the Monte-Carlo sampling [5], however naive Monte-Carlo sampling fails because the S (U, ϕ) distribution is very localized and field configurations away from the action minimum (maximum of exp (−S (U, ϕ))) are significantly suppressed. In this situation the clever integration technique of [6] is employed. Let us perform the Legendre transform and consider the Hamiltonian The π(x) are the conjugate momenta to the U(x)-fields. At every step of the algorithm, one generates the momenta from a normal distribution π(x) ∼ N(0, 1). Then the evolution in the fictive molecular time τ is performed in accordance with the Hamilton equations of motion After several steps of the evolution, the final field configuration U , ϕ , π is accepted as the next Monte Carlo sample with the Metropolis probability P(H → H ) = min 1, exp (H − H) . At sufficiently large number of iterations the field configurations (U, ϕ) will be distributed in accordance with the required weight W(U, ϕ) = exp − S (U, ϕ) . The algorithm can be imagined as a Brownian motion in the highly-dimensional configuration space. Despite the fact that at this point the algorithm for LQCD is much more feasible, there are still many pending problems. First of all, the Hamilton dynamics itself is described by a very complicated partial differential equation in 4-dimensional space-time. The variable manipulations involved complicated algebra such as multiplication of complex SU(3) fields and matrix expotentiation exp U µ (x). However, the main numerical effort resides in the computation of the force ∂H(x, τ)/∂U(x, τ) which contains the inverse Dirac operator D (U) + m −1 ϕ(x).

Omelyan integrator and scale separation
The general idea behind the HMC algorithm is to evaluate the field configuration far enough from the previous configuration to consider them independent (non-correlated) Monte Carlo samples. Since the computation of the force at every integration step is very demanding, one needs to reduce the number of force computations for the same length of τ-molecular-time evolution. The novel Omelyan integrator [6] splits the integration time τ into several pieces where the momenta (P) and fields (U) are varied independently Here λ ≈ 0.193. The method allows to reach an O(∆τ 3 ) integration error instead of the standard O(∆τ 2 ). This allows to perform much longer integration trajectories at the cost of smaller force evaluations.
Another strong idea behind speed-up is that the fermionic part of the force −∂S F /∂U varies much slower than the gluonic part −∂S G /∂U. So it is possible to keep the fermionic force term constant within several reevaluations of the gluonic force. This allows to reduce the number of D (U) + m −1 ϕ(x) evaluations within a factor of ∼ 3.

GPU, linear algebra and memory alignment
The LQCD formulation operates essentially with large vectors and operators. The action is written as the sum of terms over lattice sites. The LQCD can be efficiently speed-up by GPU handling. The modern GPUs have ∼ 3×10 3 working threads and allow the achievement of drastic speed-up in LQCD simulations. TheD(U) + m operator action and standard linear algebra (LA) operations such as vector manipulation, scalar products can be efficiently paralellized. It is also worth noting the importance of the contemporary software usage. For instance, the "thrust" package provides efficient scalar product routines for the GPU. The package employs clever reduction algorithm generating minimal overhead. Another example is the "cuBLAS" library which efficiently implements the LA operations. It is also important to consider the coalescence issue. Even though the number of threads on the GPU is ∼ 3×10 3 , many of them might be stalled because of the memory read latency. In typical LQCD application, each thread reads ∼ 400 bytes of data from the GPU global memory. Such reads should be coalesced, so that subsequent threads read subsequent memory addresses (Figure 1). If the reads are coalesced, then the data for several threads is obtained in single read, which reduces the memory bandwidth load. If the reads are not coalesced, only ∼ 10 % threads were working simultaneously in our computations. However, the coalescence allowed to get rid of the stalled threads.  volumes lead to substantial increase of the simulation times. Moreover, large lattice volumes do not fit into a single-GPU memory. To obviate this circumstances, multiGPU LQCD codes [7] have been developed.

MultiGPU applications and the "Govorun" supercomputer
As noted, the inversion of theD(U) + m operator is done within the conjugate gradient (CG) method. Each iteration of CG requires 2 scalar products, 3 LA operations andD(U) action. TheD(U) operator is local (Figure 2), so it involves nearest neighbors only in the element computation. Based on the Dirac operator structure, the following multiGPU strategy is used (Fig. 3). The lattice is cut along the z-direction (any spatial dimension, because usually L s > L t ) into N equal pieces, where N is the number of available GPUs, so L z = L s /N. In order to perform the action ofD(U) at the edge of every GPU memory, we need to know the edge of left and right neighbors, because theD(U) connects the site with its nearest neighbors. In order to keep the code simpler, we increase L z → L z + 2, so that the z = 0 and z = L z − 1 slices would contain the information about the neighbors [8]. These slices are called "buffers" and receive the data from neighbors and store it in order to performD(U) action at the edge. The extension of L z allows one to keep many functions within the code untouched.
The memory transfer is a costly operation, however in some cases its latency can be completely hidden behind the computation. To perform this transfer-computation overlap, we call the z = 1 and z = L z − 2 slices halos, while 2 < z < L z − 2 slices are the bulk (Fig. 3). The black (red color online) color within represents the bulk of the GPUs. These sites can be processed without knowledge about the neighbors. The dark gray (orange online) sites are "halos" -their processing requires data transfer to buffers.
Thus, one performs memory transfer of halos in parallel with the computation of Dirac operator on bulk. If the computations are sufficiently long, the memory transfer cost is completely hidden. We also would like to emphasize that contemporary GPUs are designed to perform parallel computations and memory copy, thus the computation speed is not reduced.  The Figure 4 shows the NVPROF profile of N = 2 GPU code executed on the DGX-1 nodes of the "Govorun" supercomputer for a lattice 64 3 × 16. Data transfers lie completely within the Dirac operator action on bulk. Thus, no additional latency is generated by the memory transfers. The code employs the strategy when every MPI process has only one GPU. This is required in order to run the code on several nodes, Thus, in order to perform scalar products one has to apply the MPI_Allreduce operation. This is the main source of latency for the profile shown in Figure 4.  We finally performed the strong and weak scaling tests for the 64 3 × 16 typical contemporary lattice on the K80 "blade" of "HybriLIT" nodes and DGX-1 nodes of "Govorun" supercomputer. The results for the strong scaling are shown in Figure 5, while the results for weak scaling do not deviate from perfect, so we do not show them here. On both "blade" and DGX-1 queues the scaling is almost perfect up to 4 GPUs. We then see significant deviation at 8 GPUs for the "blade" queue. The reason it the slow Infiniband internode transfer. This problem is unimportant on two DGX-1 nodes.

Discussion and perspectives
The LQCD main algorithms and workflow have been described. The developed multiGPU LQCD code which allows us to significantly speed-up computations and consider much larger lattice volumes was described. The tests performed at the "HybriLIT" platform show that fast CUDA-aware MPI, fast intranode communication on DGX-1 and quite good internode connection on "blade" (K80) allow to perform efficient and scalable computations with small overhead. The created code allows us to start real time simulations connected to the future NICA experiments. The projects include the study of sign-problem-free QCD-like theories such as two-color QCD at finite baryon density [9,10], SU(3) QCD with isospin density [11] and imaginary baryon chemical potential. We will also study the transport properties of quark-gluon matter such as viscosities [12,13] and electromagnetic conductivity.