HPC Systems in the Next Decade – What to Expect, When, Where

. HPC systems have seen impressive growth in terms of performance over a period of many years. Soon the next milestone is expected to be reached with the deployment of exascale systems in 2021. In this paper, we provide an overview of the exascale challenges from a computer architecture’s perspective and explore technological and other constraints. The analysis of upcoming architectural options and emerging technologies allow for setting expectations for application developers, which will have to cope with heterogeneous architectures, increasingly diverse compute technologies as well as deeper memory and storage hierarchies. Finally, needs resulting from changing science and engineering workﬂows will be discussed, which need to be addressed by making HPC systems available as part of more open e-infrastructures that provide also other compute and storage services.


Introduction
Over several decades, high-performance computing (HPC) has seen an exponential growth of performance when solving large-scale numerical problems involving dense-matrix operations. This development is well documented by the Top500 list [1], which is published twice a year since 1993. In various regions around the world, significant efforts are being performed to reach the next milestone of performance with the deployment of exascale systems. For various reasons, improving on the performance of HPC systems has become increasingly more difficult. In this paper, we discuss some of the reasons.
The design of future exascale systems depends strongly on long-term trends of key technologies. Analysing these technology trends, therefore, give some indication of how HPC will develop in the future. Here it needs to be taken into account that most of the performance relevant technologies, e.g. CPU technologies, are with a few exceptions not driven by the HPC market itself. Given the high costs for developing such technologies, market trends have a critical impact on their evolution. A current example is the huge growth of the machine learning and artificial intelligence market, which started to have a clear impact on the design of new compute accelerators including GPUs. This led, e.g., to the introduction of tensor core instructions [2] or of instructions for reduced-precision floating-point number representations such as bfloat16.
This paper is organised as follows: We start with an overview of the exascale challenges and programs in section 2. In section 3 we will explore the trends for the selected exascale technologies and architectures and provide a brief overview of exascale programming trends in section 4. Before providing our conclusions in section 6, we discuss how HPC infrastructures might evolve in the future in section 5.

Exascale computing
For a large number of science and engineering applications the throughput of floating-point operations while executing dense matrix-matrix multiplications cannot be considered a relevant performance metric. In this paper, we, therefore, use instead of exaflop systems the term exascale systems. With this term, we refer to HPC systems that allow addressing new science and engineering challenges by providing 1-2 orders of magnitude more performance compared to systems available in 2019. This definition implies that there is no single relevant performance metric when it comes to HPC systems, which have the purpose of enabling new science and engineering.
To justify different exascale programs as well as to prepare applications for leveraging the performance of exascale systems, the scientific and engineering case for exascale computing has been explored in different studies (see, e.g., [3,4]). The areas that are expected to particularly benefit from exascale systems are among others the following: fundamental sciences (including particle physics), numerical weather prediction, climate research, earth sciences, life sciences, biological and environmental research, research on energy, and materials sciences. There are a number of more dedicated studies for different science domains. An analysis of the readiness of Lattice Gauge Theory calculations for exploiting exascale systems in the US is such an instance [5]. One of the most careful analyses of future computational needs on exascale systems comes from the area of numerical weather predictions and climate research [6].
Different regions in the world have embarked on multi-year programs for designing exascale technologies and systems as well as to procure, deploy and operate such systems: • In the US, the Exascale Computing Project (ECP) was initiated with the goal of delivering first exascale systems in 2021. Within this program, three exascale systems are planned to be delivered: Intel has been awarded a contract for a system at Argonne National Lab and HPE (formerly Cray) received contracts for systems at Oak Ridge National Lab and Lawrence Livermore National Laboratory.
• In Japan, the development of the Fugaku system was announced 2014, which is currently being installed at RIKEN. Unlike the US systems, which are leveraging GPUs to enhance the computational performance, this system is based on a new CPU architecture, which was developed by Fujitsu in collaboration with RIKEN specifically for this system.
• To allow for the realisation of European exascale systems, the European Commission founded together with 32 European states a new organisation called EuroHPC [7]. Eu-roHPC is currently procuring 3 pre-exascale systems and is preparing for the realisation of at least 2 exascale systems thereafter.
• Details of the Chinese roadmap are currently unclear. Plans had been published on the realisation of 3 pre-cursors for future exascale systems, which are exploring different directions in the architectural design space [8].
All these exascale programs have in common that they have to address the key challenges for reaching exascale, which have already been established at the end of the 2000s [9]. Still the most critical is the Energy and Power Challenge as relatively few sites can afford operating systems consuming around 30 MWatt of electricity. Also, the Memory and Storage Challenge needs very high attention as the lacking ability to read (write) data from (to) memory or storage fast enough is for many applications the most critical performance bottleneck. The Concurrency Challenge refers to the fact that clock rates stopped to increase as a consequence of reaching the end of Dennard scaling [10]. Increasing throughput of arithmetic operations, therefore, can only be increased by increasing parallelism at multiple levels. Finally, there is a Resiliency Challenge as the systems are becoming more complex and individual components becoming more sensitive to operating environments. This results in decreasing meantime between failures.

Exascale hardware architectures and technologies
In this section, we explore selected technologies and architectures that are relevant for exascale systems.

Computing devices
As of November 2019, 451 out of 500 systems, which were listed on the Top500 list [1], were based on Intel Xeon CPUs. These processors provided a continuous ramp-up of throughput of floating-point numbers as Intel was increasing both, the number of cores as well as the width of the SIMD instruction operands. SIMD instruction set architectures like AVX512 take, e.g., 3 input operands that comprise of 8 double-precision numbers each to perform up to 16 double-precision floating-point operations per instruction. The other main provider of CPUs based on the x86 instruction set architecture, AMD, recently changed its processor portfolio with the new EPYC architecture. It can be expected that the number of systems based on this CPU will increase significantly in the near future.
There are arguments supporting the expectation that the diversity of CPU architectures used for HPC systems will increase in terms of instruction set architectures. The two systems leading the Top500 list as of November 2019 are based on IBM's POWER9 processor. With 9 out of 500 listed systems being based on POWER9, the market share is relatively small and it is unclear, how this will evolve in the future. Recently, Arm-based CPU architectures started to become a real option for HPC systems. First systems have been realised based on Marvell's ThunderX2 processor and Fujitsu's A64FX processor [11]. In Europe, the European Processor Initiative (EPI) is working on a processor suitable for HPC that is based on Arm technology [12].
There are several reasons for Arm becoming an interesting technology for CPU architectures suitable for HPC. One of them is the introduction of an ISA extension called Scalable Vector Extension (SVE) [13]. SVE differs from previously introduced SIMD instruction set architectures as it does not mandate a particular length of the vectors but rather allows designers to choose any multiple of 128 between 128 and 2048 bits. The software must detect at runtime the length of the vectors for the hardware on which it is executed. SVE being vector length agnostic is interesting from an architectural perspective as it allows to make flexible trade-offs between different levels of parallelism. Processor designers may within a given hardware budget either choose for fatter cores by using longer vectors or a larger number of thinner cores with shorter vectors. The A64FX processor comprises 48 cores and uses vectors with a length of 512 bit while EPI plans for a larger number of cores with short vectors of length 256 bit.
Such as the case for CPUs, the market of compute accelerators is becoming more diverse. As of today, almost all accelerated HPC systems are using GPUs from NVIDIA. However, the exascale systems announced for the US will use GPUs coming from AMD and Intel.
One of the main drivers for using these compute accelerators is power efficiency. A popular metric to quantify power efficiency of HPC systems is the ratio of throughput of A remarkable exception is a prototype system based on the A64FX processor, which is currently at position #1. This demonstrates that high power-efficiency can be reached using more conventional CPU architectures.

Memory and storage technologies and architectures
For most applications on today's HPC architectures, the speed, at which data in memory or storage can be accessed, is likely to be the cause of performance bottlenecks. The evolution of memory technologies will have a key impact on the efficiency at which future HPC systems can be exploited. The memory technologies that are currently used for HPC systems can be classified as follows: • High-bandwidth technologies like HBM2: HBM is a high-bandwidth interface to stacked SDRAM memory.
• SDRAM DIMMs based on DDR3 or DDR4 continue to be the most widely used memory technology.
• Non-volatile memory technologies like NAND Flash or 3D XPoint are increasingly used in storage architectures as well as for storage integrated into compute nodes.
It is instructive to compare hardware configurations based on these different technologies in terms of capacity C mem and bandwidth B mem . In practice, one has to either optimise for capacity or bandwidth as can be seen from the following ratio: This ratio gives an estimate of the time needed to read or write the full memory. In Table 1 we show values for ∆τ for different hardware configurations with values differing by orders of magnitude.
To provide both large memory capacity as well as high memory bandwidth, different memory technologies need to be used, resulting in deeper memory hierarchies. An abstract view on the resulting architectures is shown in Fig. 1 together with the relevant design parameters, namely the bandwidth and capacity of the high-bandwidth and large-capacity memory tier as well as the throughput of floating-point operations. A first realisation of such an architecture is Intel's Knights Landing (KNL) processor [14].
The emergence of high-bandwidth technologies is an important opportunity for realising systems where both the throughput of floating-point operations B fp as well as the memory bandwidth B mem is significantly increased. For many HPC applications, the amount of floating-point operations per Byte which is loaded from or stored to memory is O(1). As a consequence, the performance of these applications is limited by memory bandwidth if B fp /B mem > O(1 Flop/Byte), which is the case for all current HPC architectures. If one assumes the design target B fp /B mem < 10 Flop/Byte then the use of high-bandwidth memory technologies like HBM is mandatory for GPUs. 1 The first processor using high-bandwidth memory technologies was the KNL processor, which is a discontinued processor technology. But new processor architectures supporting high-bandwidth memory are emerging, most notably Fujitsu's A64FX processor.
A similar challenge arises in the context of storage architectures. Current and future storage devices feature vastly different values of ∆τ. For today's hard drives with a capacity of 10 TByte we have ∆τ 15 − 20 hour. As a consequence, hierarchical storage architectures are started to be developed, implemented and used for HPC systems. One example is DDN's Infinite Memory Engine (IME). IME is designed as an intermediate storage layer between an HPC system and an external parallel file system and uses high-performance SSDs. IME can, e.g., used as a burst buffer [15], which allows HPC jobs to write bursts of data at high speed. This data is later asynchronously migrated to the external file system. (For an early performance evaluation of IME see [16].) The approach used for IME is further expanded in the SAGE architecture [17], which is based on the new native object storage platform Mero. Multiple storage tiers with different performance and capacity characteristics can be seamlessly integrated as data objects can be distributed over multiple tiers.

Exascale architecture swim lanes
Based on the currently announced roadmaps as well as based on the earlier analysis of the technology trends we can identify the following swim lanes for exascale system designs: • Thin node design: Nodes comprising 1 CPU or possibly 2 CPUs • Fat node design: Nodes comprising 1 or 2 CPUs and a set of tightly interconnected GPUs The thin node design is adopted for the upcoming Japanese exascale system Fugaku. It uses small and simple nodes with a single A64FX processor. This compact design is facilitated by a processor design with an integrated network interface and HBM2 memory stacks integrated in the processor package. The processor features 48 cores with 2 512-bit SVE units each running at a clock frequency f = 2 GHz, which results in a peak performance B fp = 3 TFlop/s. The use of HBM2 memory allows for a memory bandwidth B mem = 1 TByte/s. The drawback of this design choice is a relatively small memory capacity per node.
The fat node design is used for all of the 3 US exascale systems Aurora, Frontier and El Capitan. For these systems, a rather limited amount of information has been published so far. They are likely based on nodes comprising 4 GPUs, which allows estimating a lower limit for   Table 2 shows a comparison of the node architecture for both exascale swim lanes. Despite the Fugaku system using a network based on a 6-dimensional torus topology, a trend towards network topologies can be observed, which feature a reduced bi-section bandwidth. This choice is mainly driven by costs as these topologies allow reducing the number of switches and expensive optical cables. Cray introduced a dragonfly topology already for the XC series of systems with Aries interconnect [18]. For the upcoming Shasta series systems with Slingshot interconnect, on which all 3 US exascale systems are based on, the same type of topology will be used [19]. Mellanox developed a variant of this topology called Dragonfly+ [20].

Exascale programming
With hardware becoming more parallel and complex, programming models are becoming more important. These programming models provide abstractions that hide details of the underlying hardware architecture. As a result, software can be developed in a portable and ideally even in a performance portable manner. In practice, given the large and continuously evolving diversity of hardware architectures, the challenge of designing a suitably general abstraction layer is huge. A large number of programming models have been developed continuously.
Some of the models like OpenMP [21], OpenACC [22] or OmpSs [23] are based on directives, which are inserted into a serial code to guide the compiler on how to parallelise the code within a node or to offload parts of it to accelerators like GPUs. Since this approach does not always allow exploiting such accelerators in the most efficient way, native programming languages for specific accelerators continue to be developed. Examples are CUDA [24] for NVIDIA GPUs or ROCm [25] for AMD GPUs. For its upcoming GPUs, Intel advertises the use of SYCL [26], which is positioned as a more general language for heterogeneous node architectures. Task-based programming models that require explicit programming of tasks, e.g. StarPU [27], allow tasks to be implemented in native languages like CUDA. The role of the programming model and its associated runtime system is to manage efficient scheduling of the tasks at runtime, based on the discovered hardware capabilities. This has taken a step further in newer programming models like Kokkos [28] or RAJA [29]. These are designed with the goal of allowing to develop code for complex node architectures with deeper memory hierarchies and different types of compute devices in a portable manner. The dominating programming model for distributed clusters comprising more than one node is MPI [30]. MPI defines a basic set of routines that cover in particular the needs for communication of the vast majority of HPC applications. All suppliers of HPC solutions provide at least one efficient implementation of MPI and, therefore, a high level of portability is reached. Alternative models for programming on distributed clusters typically follow the partitioned global address space (PGAS) paradigm. It is based on the assumption that the global memory address space can be logically partitioned with a portion of the memory being assigned to a specific process. This global address space can be used for designing parallel programming languages like Unified Parallel C (UPC) [31], which support parallel data objects like distributed arrays. The creation of a global address space can also be used to design communication libraries that implement one-sided communication operations to get (put) data from (to) a remote location. Examples for such PGAS programming models that are becoming more popular are OpenSHMEM [32] and GPI [33]. The PGAS approach has also been used for designing parallel runtime systems that allow leveraging system-level as well as node-level parallelism. One such example is HPX [34], which could recently demonstrate good scalability for astrophysics simulations involving almost 660,000 cores [35].
In the foreseeable future, a broad variety of programming models will exist. No single currently available programming models allow for efficient exploitation of all current and upcoming architectures. Furthermore, current programming models are still very much compute oriented and lack capabilities for data-and memory-aware orchestration of data. 2 This observation means that significant efforts for keeping applications portable and, in particular, performance portable remain with application developers. To guide these efforts it is helpful to follow a separation of concerns strategy aiming for applications being designed as follows: on one hand computational tasks are specified in a way that can be easier to be handled by domain scientists, while on the other hand performance is achieved through architecture-specific back-ends designed by HPC experts [37].

Future HPC infrastructures
Many of the workflows using HPC systems used to consist of simulations that consume and produce a moderate amount of data that is initially kept on-site and is later optionally transferred to other locations for post-processing and long-term archiving. HPC centres are optimised for such workflows by providing optimised but often one-of-its-kind hardware configurations and software deployments. The available resources are consumed through batch processing, which allows for maximising utilisation of available resources.
The needs of users of HPC systems are, however, increasingly changing with emerging workflows extending beyond the HPC centre. There are various examples for such workflows related to physics experiments. This ranges from the utilisation of opportunistic computing cycles on HPC facilities for processing of data generated by high-energy physics experiments (see, e.g., [38]) to real-time processing of synchrotron data (see, e.g., [39]). These examples have in common that data coming from outside of the data centre needs to be injected into the HPC system. They also raise the need for being able to steer (to different extents) HPC processing from outside the HPC centre. Finally, these workflows are executed as a joint effort involving different scientists from different sites that need to collaboratively use different types of resources, including HPC resources.
Observing these trends contributed to the notion of a "digital continuum" that include among others HPC systems [40]. It assumes that with the goal to create understanding as soon as possible, science and engineers might in future use data originating from Internetof-Things (IoT) devices, pre-process this data using edge devices, and finally use HPC and cloud services for data analysis and simulations.
Even if this is a more long-term vision, HPC infrastructures will have to become more open in the near future. This will be challenging not only for legacy reasons but also because HPC systems have to continue to be operated in a protected environment for security reasons. The following developments can be expected: • Federated infrastructure for authentication, authorisation, accounting: While many of the technical challenges have been solved in grid and cloud infrastructures, an extension to infrastructures involving HPC systems remains challenging, in particular, due to organisational and policy challenges. • Expand service portfolio including Cloud-type services: To facilitate HPC systems becoming part of larger workflows as well as to support collaborative research models, HPC centres will have to deploy Cloud-type services to facilitate, e.g., users to start-up virtual machines for deploying services or to provide Cloud storage systems that allow for federated access. • Improve on interactivity: As of today, very limited resources are provided for interactive processing close to HPC systems and associated large-scale data repositories. Interactive computing services are also needed to make collaborative working easier. This involves the support of interactive frameworks like Jupyter notebooks. • Facilitate service composability: Future workflows will not only comprise computing on HPC systems but involve the use of other services like virtual machines or different type of storage services. • Allow for new resource allocation models: HPC resources as of today are mainly allocated based on scientific excellence determined beforehand. The associated peer-review processes are typically lengthy and they prevent flexible resource allocations as needed for emerging workflows and new resource offerings.

Conclusions and outlook
With first exascale systems becoming available in 2021, several swim lanes for exascale architectures were identified in this report. Most of the announced systems will be based on fat nodes, which comprise several compute accelerators. However, a thin node approach can lead to promising architectures, in particular, if the focus is on increasing memory bandwidth rather than on the throughput of floating-point operations. A critical enabling technology is high-bandwidth memories, which are necessary to realise systems with a memory bandwidth B mem 0.1 EByte/s. Use of future exascale and other HPC systems will be challenging for users as they will have to cope with heterogeneity and diversity. CPUs with different instruction set architectures will be used, mainly x86 and Arm. Additionally, the number of suppliers for GPUs suitable for HPC will increase, posing a significant challenge in making applications portable and, in particular, performance portable. Finally, to meet both memory performance as well as capacity demands, memory hierarchies will become deeper, resulting in the need for support of data orchestration.
Programming models can help to cope with some of these complexities, but they cannot be fully hidden to application developers. There is, therefore, an increased need for sustainable code modernisation efforts. An important strategy here is to realise a separation of concerns between domain scientists and HPC experts.
With science and engineering workflows changing, HPC systems cannot be designed as silos in future but rather should become part of e-infrastructures that extend beyond HPC centres. This will require new approaches for how to manage the boundaries to HPC environments.