Lattice QCD GPU Inverters on ROCm Platform

The open source ROCm/HIP platform for GPU computing provides a uniform framework to support both the NVIDIA and AMD GPUs, and also the possibility to porting the CUDA code to the HIP-compatible one. We present the porting progress on the Overlap fermion inverter (GWU-code) and also the general Lattice QCD inverter package QUDA. The manual of using QUDA on HIP and also the tips of porting general CUDA code into the HIP framework are also provided.


Introduction
The engineering and energy efficiency constraints push the modern supercomputer architecture to multi-level parallelism, and heterogeneous computing architectures such as CPU+GPU are widely used in top 500 supercomputers [1], including the most recent fastest Summit and Sierra. Most of the performance of them comes from the NVIDIA GPU V100, and efficient codes are essential to benefit various science computation like Lattice Chromodynamics (Lattice QCD) from those machines.
As we know that QCD is the dominant theory describing strong interaction, Lattice QCD [2,3] is the discretized version of QCD based on Euclidean space-time as shown in Fig.1 instead of Minkowski space-time. Lattice QCD calculation requires massive computation resources, and increasingly relies on GPU acceleration.
The CUDA platform developed by NVIDIA is wildly adopted for accelerating computation in many fields as well as Lattice QCD. There are already quite a few packages that supports CUDA with good performance and also multi-GPU scaling, including QUDA (for most of the fermion actions) [4][5][6], GRID (for the domain wall fermion and etc.) [7], GWU-code (for the overlap and clover fermion) [8,9], and so on.
On the other hand, few efficient codes support AMD GPU except some effort with OpenCL (e.g. CL2QCD [10]), while the peak performance of the AMD GPU have caught up and the E-flops supercomputer "Frontier" with AMD GPU will be built in US by 2021. An opensource programming platform ROCm(Radeon Open Compute), therefore was promoted by AMD in recent years to improving the experience of programming on its own CPU/GPUs. ROCm[11] platform is "the first open-source HPC/Hyperscale-class platform for GPU computing", still young and under active development. It supports kinds of languages such as HIP [12], OpenCL and Python, and supports both AMD's and NVIDIA's GPUs, as shown in Fig. 2. One can write one set of codes using ROCm/HIP to support both AMD's and NVIDIA's hardwares, which is convenient, and can port CUDA codes into ROCm/HIP with the help of official porting tools like hipify-perl[13]. Based on this, writing codes from scratch and implementing hundreds of features needed by the lattice QCD calculation, could be avoided by porting the existed CUDA codes. And in this proceeding, we will present our finding on using the package GWU-code and QUDA on the AMD GPU through the ROCm/HIP platform, with a summary on the known issues.

Porting Prerequisites
Before we started porting QUDA and GPU-Code, we analyzed the external dependent libraries. The QUDA library is officially supported by NVIDIA and depends on CUDA and many third-party CUDA libraries such as cufft, curand and cub. Fortunately there is an HIP version for each library. Table.1 shows the correspondence between CUDA and ROCm/HIP. Note that the Eigen library is partially supporting ROCm/HIP, and we had to provide some de- Table 1. CUDA modules used in QUDA and GWU-code and corresponding modules in ROCm, more details see ref. [14].

Porting Progress
In this section, we will discuss the porting progress and problems we encountered when we ported QUDA and GWU-Code. Note that ROCm/HIP is a young and active project, some problems may be solved by upgrading ROCm/HIP, while some additional problems may arise consequently.

Porting tips and compiling manual of QUDA
Generally, the porting can be separated into 3 stages: convert the code with hipify-perl, patch the codes manually to satisfy the requirement of compiler, replace the unsupported features to avoid the runtime crash. Let us take the porting of QUDA as example: 1. Convert the code with hipify-perl. The hipify-perl is a perl script to map the name of the CUDA functions to that of their HIP counterpart, and also the CUDA header files. If the script meets some unknown words starting with "cu", message "warning:... : unsupported device function" will be thrown out, while the conversion will continue.
2. Patch the codes manually to satisfy the requirement of compiler. Currently, QUDA is compiled by hip-clang, instead of HCC (Heterogeneous Compute Compiler). HC is a C++ AMP syntax language with HSA Extend [15]. HCC will translate HIP kernel syntax into C++ AMP syntax by using functional or macro grid launch, while certain QUDA device functions that use complicated class and template will cause syntax or runtime error. In the other hand, hip-clang is a hip kernel syntax supported LLVM frontend as shown in Fig.3. By setting the environment variable HIP_PLATFORM to clang, hip-clang will take over the compiling and compile CUDA-like syntax source code to LLVM IR directly [16], and then the AMD GPU backend of LLVM will compile the LLVM IR to binary. The patches we applied include: (a) Set CMAKE_CXX_SYSROOT_FLAG_CODE to add the .cu suffix to the CMAKE_CXX_SOURCE_FILE_EXTENSIONS, and then use hip-clang to compile both the .cpp and .cu files. Note that the flag "-g" should be avoid and "-_STRICT_ASNI_ -O3" is necessary to make the code works well. Compile clover_deriv_quda.cu and gauge_stout.cu will crash the compiler and add the flag "-fno-inline" would be a choice to avoid it.   can make the performance to be extremely low. Such a problem can be avoided by copying the argument class to the GPU memory first, and using its reference as the argument: Other minor changes can be found in the present branch on the github: https://github.com/lattice/quda/tree/rocm-devel.
In term of the QUDA building options, the porting progress are listed in Tab.(2)

GWU-Code Porting
Comparing to QUDA, porting GWU-code is much simpler. GWU-code use the macro to generate the D-slash GPU kernel without any device functions, and implement the vector operator on GPU with the CUDA Thrust [5,6]. Thus one just need to replace the Thrust library with the rocthrust after the code has been converted with hipify-perl. The function thrust::reduce can be very slow with double precision in certain version, but it can be replaced by the other functions with normal performance.

Performance Test
Our test is based on AMD MI60 GPU and Nvidia V100 GPU. The D-slash performance is summarized in Tab.3. The QUDA D-slash performance is somehow lower as this kernel is much more complicated and then can not be fully optimized with hip-clang at present.
As one practical application of QUDA, the multigrid [17] inverter, one useful and efficient lattice invertor, works correctly while the performance is not very promising. With the largest 96 3 × 192 lattice we tested, the total performance with 324 MI60 GPUs is around 10 TFlops, using a 3-level multigrid layouts (4,4,4,4) and (2,2,2,2), which is not as good as we expected before. The performance of multigrid inverter on ROCm is much lower than that on CUDA platform, but could be optimized further.
In the GWU-code side, 200 pairs of the Overlap eigensystem of the HYP smeared 24 3 × 64 RBC configuration at lattice spacings a=0.11fm can be generated with 4 MI60 GPU within 4 hours, and the similar calculation on E5-2698v3 at 2.3 GHz requires 1024 cores up to 2 hours. The result is acceptable in a way, and strong scaling and weak scaling are reasonable. But again, it's much lower compared to that on CUDA platform. We still need more effort to improve the performance. The test with larger lattice size is in progress.

Summary
In Summary, we ported two Lattice QCD CUDA packages, QUDA and GWU-code to the AMD GPU platform ROCm using HIP. The performance is around half of that on CUDA platform when the memory bandwidth of both are similar. The multigrid inverters of QUDA works correctly with the lattice as large as 96 3 × 192, and the overlap eigensystem can be calculated correctly with GWU-code. We will try to optimize the performance and scaling in the further study. All the present tests using HIP are performed on AMD GPUs, and will be investigated on NVIDIA GPUs as well.
This work is supported in part by the National Key Research and Development Program of China (No. 2017YFB0203200) and the Strategic Priority Research Program of Chinese