Optimization of the Brillouin operator on the KNL architecture

Experiences with optimizing the matrix-times-vector application of the Brillouin operator on the Intel KNL processor are reported. Without adjustments to the memory layout, performance figures of 360 Gflop/s in single and 270 Gflop/s in double precision are observed. This is with N_c=3 colors, N_v=12 right-hand-sides, N_{thr}=256 threads, on lattices of size 32^3*64, using exclusively OMP pragmas. Interestingly, the same routine performs quite well on Intel Core i7 architectures, too. Some observations on the much harder Wilson fermion matrix-times-vector optimization problem are added.


Introduction
Conceptually the Brillouin operator D B is a sibling of the Wilson Dirac operator D W , since D W (x, y) = µ γ µ ∇ std µ (x, y) − a 2 std (x, y) + m 0 δ x,y − c SW 2 µ<ν σ µν F µν δ x,y D B (x, y) = µ γ µ ∇ iso µ (x, y) − a 2 bri (x, y) + m 0 δ x,y − c SW 2 µ<ν σ µν F µν δ x,y share the same structure 1 with σ µν = i 2 [γ µ , γ ν ] and F µν the hermitean clover-leaf field-strength tensor. The main difference is that the isotropic derivative ∇ iso µ and the Brillouin Laplacian bri both include 80 neighbors (all within the [−1, +1] 4 hypercube around a given lattice point x, i.e. up to 4 hops away from x, but no two hops may be in the same direction), rather than the 8 neighbors present in the standard derivative ∇ std µ and the standard Laplacian std . Given that the Brillouin operator has a larger "footprint" and hence more operations per site than the Wilson operator, a natural question to ask is whether D B is more suitable for modern architectures (which typically involve lots of cores, but are limited by memory bandwidth) than D W .
As a first step to address this question, I decided to come up with a simple implementation of the Brillouin operator on the Intel KNL architecture. More specifically, this boundary condition is meant to imply that (i) only shared memory parallelization via OpenMP pragmas on a single CPU is used (i.e. no distributed memory parallelization with MPI), (ii) the memory layout is unchanged from the generic layout used throughout my code suite (see below for details), and (iii) no single-thread performance tuning is attempted beyond adding straightforward SIMD hints (again via OpenMP pragmas). To the expert these constraints may look unnecessarily tight, but it turns out that nonetheless sustained performance figures 2 of 360 Gflop/s in single precision (sp) arithmetics can be achieved.

Brillouin operator in a nutshell
Let (λ 0 , λ 1 , λ 2 , λ 3 , λ 4 ) ≡ (−240, 8, 4, 2, 1)/64; then the free bri in (2) takes the form where (ρ, ν, µ) means that ρ, ν and µ are summed over, subject to the constraint that no two elements are equal. Similarly, with (ρ 1 , ρ 2 , ρ 3 , ρ 4 ) ≡ (64, 16, 4, 1)/432 the free ∇ iso µ in (2) takes the form where (ρ, ν; µ) means that only ρ and ν are summed over, while still no two out of the three indices may be equal. In the interacting theory, these stencils are to be gauged in the obvious way. For instance, considering (3) we see 8 terms (∝ λ 1 ) with 1 hop; they are dressed with U µ (x) or U µ (x−μ) † , for positive or negative µ, respectively. Similarly, there are 24 terms with 2 hops; they are dressed with off-axis links of the form 1 Next, there are 32 terms with 3 hops; they are dressed with off-axis links which are the average of 6 products of three factors each. And finally, there are 16 terms with 4 hops; they are dressed with hyperdiagonal links built as the average of 24 products of four factors each. Further details of the operator can be found in [1,2].

Code suite overall guidelines
The overall guidelines of the code suite are best illustrated by taking a look at the Wilson operator which is said to operate on a vector of length N c 4N x N y N z N t , where N c is the number of colors. From a computational viewpoint it is extremely convenient to declare the source and sink vectors as complex arrays of size (1:Nc,1:4,1:Nv,1:Nx*Ny*Nz*Nt). Here the ordering and the stride notation common to Matlab and Fortran are used; with the default counting from 1 we could write (Nc,4,Nv,Nx*Ny*Nz*Nt). A special feature is that we have one slot to address the right-hand-side (rhs), i.e. N v columns (in the mathematical setup terminology) can be processed simultaneously. The underlying philosophy is that index computations are done by the compiler, except for the site index n=(l-1)*Nx*Ny*Nz+(k-1)*Nx*Ny+(j-1)*Nx+i, where we want to keep some freedom.
The working of these guidelines, on the basis of the Wilson operator (5), is spelled out in the code assembled in Figs. 1 and 2. The arrangement of the loops over the x, y, z, t coordinates (denoted i,j,k,l, respectively) makes sure the innermost loop belongs to the fastest index. A noteworthy mathematical detail is how the N c 4 × N c 4 matrix 1 2 (γ µ − I) ⊗ U µ (x) acts on the N c 4 × 1 vector old(:,:,idx,n) with given rhs and site indices. This product is realized by reshaping it into a N c × 4 matrix (which it already is in our setup), multiplying it with U µ (x) from the left, and with 1 2 (γ µ −I) trsp from the right. The full-spinor variable full(Nc,4) contains the result of the former multiplication; the γ-operation just reorders its columns (modulo some signs and factors of i).

Brillouin kernel details
The Brillouin matrix-times-vector routine, built according to the guidelines laid out in the previous section, is portrayed in Figs. 3 and 4. The four-fold loop structure over the out-vector is OMPparallelized in the a straightforward way (the COLLAPSE(2) statement makes sure that up to N z N t threads can be launched). The most important difference to the Wilson routine is that 40 out of the 81 directions of the off-axis links W, build from the smeared gauge field V, are assembled in the complex array W(Nc,Nc,40,Nx,Ny,Nz,Nt); the remaining ones are the identity or the hermitean conjugate of W in a hypercube related point (i.e. up to four hops away).

Brillouin operator timings
Timings are done on a node containing a single KNL chip with 64 cores. All results for the Brillouin operator are converted into Gflop/s, based on a flop count of 2560N 2 c + 2376N c per site (i.e. 30168 for QCD, see [2] for details). As a default setup we shall use a 32 3 × 64 lattice, with N c = 3 and N v = 4N c . This geometry is chosen such that a sp-field W, a sp-field F, and the dp-fields old,new fit into the high-bandwidth MCDRAM of 16 GB. For any number of threads, memory allocation of these objects to the individual cores is done by a first-touch policy. The static thread scheduling makes sure every thread gets exactly the same fraction of the out vector to work on. Compilation is done with ifort version 17.2, with the flags -qopenmp -O2 -xmic-avx512 -align array64byte.
The scaling in the number of threads is shown in Fig. 5. We see an almost perfectly linear behavior up to 64 threads; this means that there is essentially no scheduling overhead. Beyond 64 threads we see load balancing issues among the various threads, except for multiples of 64, where each core is kept busy with exactly the same number of threads.  Some more details are presented in Tab. 1. We see some mild improvement when going from 2 to 4 threads per core; beyond 256 threads the performance plateaus. When comparing sp to dp figures, one should keep in mind that the objects W and F are always in single precision (occupying 5760 MB and 864 MB, respectively), only the vectors old and new change from sp to dp.
Perhaps the most surprising observation is that the same routine (compiled with -xcore-avx2 instead of -xmic-avx512) performs well on a standard Core i7 architecture (Broadwell with 6 cores).
Here the plateauing effect sets in after each physical core is occupied with 2 threads.
Some more experiments on the KNL architecture in sp arithmetics are reported in Tab. 2. The first panel demonstrates that the ∼ 360 Gflop/s are more or less independent of the volume, i.e. we do not see any peculiar cache size effects. The second panel shows that using more than 12 right-hand-sides (for N c = 3) improves the performance, but beyond 24 right-hand-sides benefits become marginal. Finally, increasing the number of colors beyond 3 is found to be particularly beneficial. My personal guess is that with N c = 4 (or a multiple thereof) the colormatrix-times-spinor multiplication in the SIMD loop in Fig. 4 becomes particularly efficient due to a better filling of the SIMD pipeline.

Wilson operator timings
Timings are done on a node containing a single KNL chip with 64 cores. All results for the Wilson operator are converted into Gflop/s, based on a flop count of 128N 2 c + 72N c per site (i.e. 1368 for QCD, using the "shrink-expand-trick", see e.g. [2] for details). The default setup is again a 32 3 × 64 lattice, with N c = 3 and N v = 4N c . This time a sp-field V, a sp-field F, and the dp-fields old,new fit well into the high-bandwidth MCDRAM. For any given number of threads, all arrays are allocated afresh, and a first touch policy is used as in the Brillouin case.
The scaling in the number of threads is shown in Fig. 6. Again, we see a linear behavior up to 64 threads; and beyond that local maxima are seen for multiples of 64 threads where each core is kept busy with exactly the same number of threads.
Some more details are presented in Tab. 3. On the KNL some some mild improvement is seen when going from 2 to 4 threads per core; beyond 256 threads the performance plateaus. When comparing sp to dp figures, one should keep in mind that the objects V and F are always in single precision (occupying 576 MB and 864 MB, respectively), only the vectors old and new change from sp to dp. From the second panel we learn that also the Wilson routine performs (without any change) quite well an the standard Core i7 architecture (Broadwell with 6 cores). The main difference to the KNL case is that optimum performance is reached with 1 to 2 threads per core rather than 4.  Table 4. Conversion of the performance measurements into sustained percentage figures, based on a peak performance of 5.2/2.6 [sp/dp] Tflop/s on the KNL and 690/345 Gflop/s on the Core i7 (Broadwell) architecture.