Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named “Ivy Bridge-EP”) in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named “Knights Landing” (KNL). The results show 2 times better performance on KNL processor.


Introduction
The second generation Intel XeonPhi T M processors code-named Knights Landing (KNL) are expected to deliver better performance than general purpose CPUs like Intel Xeon processors for applications with both high degree of parallelism and well behaved communications with memory [1].Compute-bound applications run better on KNL due to its larger (512-bit) vector registers.Bandwidth-bound applications also run better on KNL due to its high bandwidth memory (HBM).
In this work we consider an interesting from computational point of view example of numerical solution of coupled 2D perturbed Sine-Gordon equations.We really need serious computational resources because in some cases the computational domain size may be very large -10 6 -10 8 mesh points and very long time integration is also needed -10 8 -10 9 time steps.Usually applications with stencil operations (like those in the presented work) are bandwidth-bound.The calculation of the transcendental sine function however makes our application to be closer to the compute-bound case and hence it benefits both from using HBM and from vectorization.
The considered systems of coupled 2D perturbed Sine-Gordon equations are of practical interest because it is well known that they model the dynamics of the so called Intrinsic Josephson Junctions (IJJ) [2].
The goals of the work are: • To make an OpenMP realization of a finite difference scheme for solving systems of 2D perturbed Sine-Gordon equations.We want this realization to explore both thread and SIMD levels of parallelism.

Mathematical model and numerical scheme
We consider the following systems of 2D perturbed Sine-Gordon equations : Here ∆ is the 2D Laplace operator.Ω is a given domain in R 2 .S is the N eq × N eq cyclic tridiagonal matrix: , where −0.5 < s ≤ 0 and N eq is the number of equations.The unknown is the column vector ϕ(x, y, t) = (ϕ 1 , . . ., ϕ N eq ) T .Neumann boundary conditions are considered: In ( 2) n denotes the exterior normal to the boundary ∂Ω.To close the problem (1)-( 2) appropriate initial conditions are posed.The model ( 1)-( 2) describes very well the dynamics of N eq periodically stacked IJJs [2].The parameter s represents the inductive coupling between adjacent Josephson junctions, α is the dissipation parameter, γ is the external current.All the units are normalized as in [2].
We follow the approach of construction of the numerical scheme from [3].We solve the problem numerically in rectangular domains by using second order central finite differences with respect to all of the derivatives.As a result at every mesh point of the domain we have to solve a system with the tridiagonal matrix S .Because of the specific tridiagonal structure of S we need only (9N eq − 12) floating point operations for solving one system.So the algorithm complexity is O(N eq • N x • N y • N time_steps ), where N eq is the number of equations, N x is the number of mesh points in x-direction, N y is the number of mesh points in y-direction and N time_steps is the number of steps in time.

Parallelization strategy and performance scalability results
OpenMP Fortran realization of the above numerical scheme is made.We store the unknown solution of two consecutive time levels in two multidimensional arrays U1(N eq , N x , N y ), U2(N eq , N x , N y ).To ensure a good data locality, the main loop over indexes of U1 and U2 follows the column major order of multidimensional arrays in Fortran language.
To utilise the computational capabilities of the considered processors we explore two levels of parallelism: SIMD parallelism (vectorization) at the inner level and thread parallelism at the outer level.The smallest piece of work which we consider consists of solving 8 successive linear systems with the cyclic tridiagonal matrix S .Such pieces of work are distributed between OpenMP threads.Both the calculation of the right-hand sides for each linear system and solving 8 linear systems at once The use of -O2 optimization flag for compiling ensures an automatic vectorization of the innermost loops.Therefore the indexes of the outermost loop are distributed between the OpenMP threads and all the innermost loops (all with length 8) are vectorized.Let us mention that 8 double precision words correspond to the length of one vector register in KNL processors -512 bit and two vector registers (256 bit each) in Ivy Bridge-EP processors.As a result we achieve about 2 times better performance from vectorization on "Ivy Bridge-EP" and about 4 times better performance from vectorization on KNL processors.The achievement of 50 % effectiveness from vectorization can be explained by the fact that our application is somewhere between bandwith-bound and compute-bound cases.On the one hand the application is of stencil type which type is by rule bandwidth-bound.On the other hand a calculation of the transcendental sine function is needed at every mesh point of the computational domain, which makes our application closer to the compute-bound case.
In the following figure the computational domain size is: N eq × N x × N y = 8 × 4096 × 4096.As seen from this figure we achieve good performance scalability on both architectures and 2 times better performance on KNL processor.

Numerical example of a nonlinear standing wave
To check that the realized OpenMP program really works, we repeated the numerical results (in 2D case) from the classical works [5,6].As explained in these papers the powerful THz radiation from IJJs reported in [7] corresponds to a new type of standing wave solutions with excited so called cavity modes.For certain parameters α and γ the phase ϕ(x, y, t) in a particular equation (junction) is a sum of three terms: a linear with respect to the time term, vt, a static π kink term pi_kink(x, y) and an oscillating term: ϕ(x, y, t) = vt + pi_kink(x, y) + oscillating_term(x, y, t) The oscillating term is approximately a solution of the linear wave equation: with Neumann boundary conditions.As opposite to the oscillating term, which is the same for each junction (equation), the static term (in our case static π kinks) have alternative character, i.e. opposite static π kinks alternate in odd and even junctions (equations).

Conclusions
An OpenMP program for solving systems of 2D perturbed Sine-Gordon equations is realized and tested on two different Intel architectures: one computational node in the Hybrilit cluster consisting of two Ivy Bridge-EP processors and a KNL processor provided by RSC Group, Moscow.The results shows 2 times better performance on the KNL processor.

Figure 1 .
Figure 1.Performance scalability as speedup relative to one vectorized thread on 2 × "Ivy Bridge-EP" processors