Application of SLURM, BOINC, and GlusterFS as Software System for Sustainable Modeling and Data Analytics

. Modern numerical modeling experiments and data analytics problems in various ﬁelds of science and technology reveal a wide variety of serious requirements for distributed computing systems. Many scientiﬁc computing projects sometimes exceed the available resource pool limits, requiring extra scalability and sustainability. In this paper we share the experience and ﬁndings of our own on combining the power of SLURM, BOINC and GlusterFS as software system for scientiﬁc computing. Especially, we suggest a complete architecture and highlight important aspects of systems integration.


Introduction
Distributed environments have become very influential in the contemporary data processing and analytics.During several decades, we had been observing architecture changes emerging to support computation trends of various forms.Among them, there were hardware-defined platforms, SSI-based systems [1] and clustered multi-node solutions, orchestrated by different batch schedulers.Contemporary computing architectures [2] involve clustering approach, managing computing processes among the nodes of a large distributed computing platform.Many papers have been written about the general nature of the distributed computing.In this paper, we are going to discuss and focus on more "local" and application-specific tasks.
The problem of infrastructure deployment arose during 2015-2016, when we decided to process medium-sized data sets and perform corresponding computations for several projects led by Dr. I.L. Kaftannikov.At early stages, we were using volunteer computing resources managed by the BOINC system.The more data we wanted to process, the more inner resources we had to involve.The problem is that with the growth of the project, we had to redefine the local infrastructure, making it more reliable and stable.We also had some computational jobs, which required additional control and it was no longer possible to use it in the context of BOINC, as we required lower latencies and higher degree of security.At the same time, we wanted to save the existing approach, allowing volunteers to integrate their computational resource base.

Early Stages Architecture. BOINC System
As we stated above, we decided to use the BOINC system as initial solution.It is a free software platform [3] developed by the Berkeley University, implementing a volunteer public-resource model of distributed computations.BOINC allowed us to involve different types of computing powers, connecting many individual computers to a common computational network.Owners of physical or virtual computing resources were able to join us publicly via the BOINC client program and to help by contributing some resources.User-side part consists of a client and a graphical program -BOINC Manager.The manager brings the GUI for controlling of one or many BOINC clients.Proper details of BOINC project can be found in [3], improvements in [4].The most recent changes and documentation can be obtained from the Github repository and the official website.
Many popular BOINC projects [3] demonstrate significant computational performance, but with some strict limitations and only over a certain class of jobs.This of course allows scientific groups to carry out resource-intensive calculations without the use of expensive hardware.But with this approach we faced two major architectural problems: 1.The architecture of data storage is undefined as the BOINC system operates at a higher layer of data abstraction.
2. The scheduling model within BOINC system is inconsistent with the local jobs distribution, which asks for a fast acting system with low transition rates.
Based on Refs.[5][6][7] we decided to use the SLURM as a local batch scheduler and GlusterFS as a distributed file system for storing experimental data sets.

Redefined Architecture. GlusterFS and SLURM
GlusterFS is a distributed, parallel, linearly scalable, failure-tolerant file system.We used it in the context of TCP/IP networking to export data "bricks" from different servers, forming a single parallel distributed data space.GlusterFS operates in user space, based on FUSE technology and works on the top of the existing file systems, so we didn't have to face the kernel support problem.Unlike other distributed file systems, such as Lustre and Ceph [5], GlusterFS does not require a separate server for storing metadata1 , providing a great degree of flexibility and scalability.By bringing GlusterFS to our BOINC stack we changed the storage mechanisms, allowing storing our data in a more efficient way.The whole solution runs on CentOS 6.8.BOINC and SLURM form computational core.Custom meta-scheduler controls the job distribution process.It classifies jobs and sends them into the correct queue according to the specific scientific workflow.GlusterFS provides extensible single parallel distributed data space.OpenVPN forms a secure overlay network for communication of the inner services.We used it to connect several storage nodes that weren't accessible directly via IPv4 address.Administrators and users interact with the system via SSH commands and GUI.
The other valuable part of our new architecture is SLURM.The developers describe it as scheduling system for large and small Linux clusters.It requires no kernel modifications for its operation and is relatively self-contained [7].However, the integration with SLURM required jobs classification which was done within the following scheme: 1. Jobs with low data exchange intensity, possibility of execution pausing and reassignment2 are running on the volunteer resources.
2. Jobs with intensive data exchange or strict deadlines are running on the local cluster and scheduled via SLURM.
After performing the integration depicted in figure 2, SLURM filled an important gap in our architecture, providing the ability for starting, executing, and monitoring jobs on the set of locally allocated nodes.Now, when a user registers the specific job in system's workflow, meta-scheduler can run it locally, or assign it to any possible volunteer's computing resource, depending on the set of preferences.The distribution scheme [8][9][10] as well as whole infrastructure tuning, monitoring [11] and maintenance can turn to really complex solution.This is beyond the scope of this work, but we suggest papers and books mentioned in our references for interested readers.

Figure 1 .
Figure 1.Architecture of the BOINC project.The platform is centralized and includes server and user-side parts.User-side part consists of a client and a graphical program -BOINC Manager.The manager brings the GUI for controlling of one or many BOINC clients.Proper details of BOINC project can be found in[3], improvements in[4].The most recent changes and documentation can be obtained from the Github repository and the official website.

Figure 2 .
Figure2.Diagram of the new architecture.The whole solution runs on CentOS 6.8.BOINC and SLURM form computational core.Custom meta-scheduler controls the job distribution process.It classifies jobs and sends them into the correct queue according to the specific scientific workflow.GlusterFS provides extensible single parallel distributed data space.OpenVPN forms a secure overlay network for communication of the inner services.We used it to connect several storage nodes that weren't accessible directly via IPv4 address.Administrators and users interact with the system via SSH commands and GUI.