A Feasibility Study on workload integration between HT-Condor and Slurm Clusters

,


Introduction
There are two local computing clusters in the Institute of High Energy Physics(IHEP), one is a HTCondor cluster, the other is a Slurm cluster.Most jobs running on the HTCondor cluster are single-core jobs, while parallel and multi-core jobs are running on the Slurm cluster.
The resource utilization ratio of the HTCondor cluster has reached more than 90% , which means it has reached the bottleneck of resource provision for now.However, the workload of the Slurm cluster is relatively not heavy, and the resource utilization ration is 50% on average.If HTCondor jobs could be migrated and run on Slurm cluster, users of the HTCondor cluster could have more resources to run their jobs, and resource utilization ratio of the Slurm cluster would be increased at the same time.
To testify that workload integration between HTCondor and Slurm clusters is feasible, section 2 lists and compares related and similar works.Later, three ways to inter-connect the HTCondor cluster and the Slurm cluster are investigated in section 3, while issues and possible solutions related with job scheduling and resource allocation are described in section 4.And section 5, the last part of this proceeding, gives the conclusion of our work.

Related works
Similar research about workload migration was done on how to take advantage HPC resources for HEP computing applications.B. O'Brian et al. [1] proposed to run HTC jobs on HPC sites, and listed possible obstacles including job submission, data management and network connections etc.In the following years, many HEP experiments including ATLAS, CMS and ALICE, proposed individual solutions to address these obstacles and to make better use of HPC resources.For example, ATLAS is using ARC CE to bridge grid and HPC cluster, and Event Service is provided to address the issue that jobs may get preempted without any warnings [2][3][4].CMS adopts HTCondor and ROCED to run jobs inside of Virtual Machines on HPC clusters [5,6].ALICE makes use of ANALISA to simplify access to HPC systems by handling job submission, software management, and local data management [7].Conventionally, good use cases for HPC cluster are computation heavy applications, that is the reason why most of workload from HEP computing to HPC clusters are simulation jobs.To make I/O intensive jobs from HEP experiments to be run on HPC smoothly, NERSC Cori provides a storage cache layer named Burst Bu↵er to improve I/O performance [8].Besides, to make installation of HEP software stack more convenient, a container technology named shifter is developed to provide the same system environment as grid on NERSC Cori [9].
All the mentioned researches aim to solve one or more issues about HTC workload migration to HPC sites.When we try to migrate HTC jobs from the HTCondor cluster to the Slurm cluster at IHEP, similar obstacles such as job submission, system environment are confronted as well.However, because HTCondor and Slurm clusters are located at the same LAN and mount the same shared file systems, network and data are not issues anymore.As a result, we focus on more specific issues including better ways to inter-connect HTCondor and Slurm clusters and other issues related with job scheduling and resource allocation.C. Hollowell et al. [10] tested HTCondor-C to mix HTC and HPC workloads.Comparing with the research C Hollowell et al. did, we tested two more ways, overlap and flocking, to inter-connect HTCondor cluster and Slurm cluster.And HTCondor-C is more preferable for our future work, more details are presented in section 3. Additionally, we also proposed three issues and possible solutions in section 4.

Job migration
The first question about workload integration is how to migrate jobs from HTCondor to Slurm.The most instinct way to run HTCondor jobs on the Slurm cluster is by overlapping part of work nodes from the Slurm cluster with the HTCondor cluster.Because there is no official name of this interconnection way, we would call it 'overlap' in this paper.Apart from overlap, HTCondor provides two native methods: flocking and HTCondor-C [11].In simple words, to reach the aim of job migration from HTCondor and Slurm, flocking treated the HTCondor cluster and the Slurm cluster as two separated pools.And jobs submitted to the HTCondor cluster will be flocked to the Slurm cluster if the jobs are not allowed to run immediately.Comparing with flocking, HTCondor-C is easier to understand.It is designed to run HTCondor jobs on other heterogeneous clusters, for example, a Slurm cluster in our   circumstances.With HTCondor-C, jobs from HTCondor will be transformed into Slurm jobs and migrated to the job queue of the Slurm cluster.Later, the migrated jobs will get opportunities to run if there are available resources.To testify if these three methods are working, a testbed is set up: a host slurm04 is configured as a HTCondor cluster, and a host slurm03 is configured as a Slurm cluster.

Overlap
As it is earlier mentioned, HTCondor jobs are run on the Slurm cluster by overlapping part of worker nodes from the Slurm cluster.When it comes to installation and configuration, nodes overlapping is accomplished by running HTCondor daemons on Slurm worker nodes, which means that both Slurm and HTCondor daemons are running on some work nodes, and that is the exact reason why this method is named 'overlap'.
From the view of HTCondor, those overlapped worker nodes are the same as the original worker nodes of the HTCondor cluster, meanwhile, Slurm does not know that some of its work nodes are added to the HTCondor cluster.Then here comes the question that, what's going on to happen if both Slurm and HTCondor try to allocate the same resources to jobs?To avoid resource allocation confliction, some codes required to be added to the prolog and epilog scripts of the Slurm cluster.In simple words, the codes in the prolog script were written to stop HTCondor jobs and release resources for the coming Slurm jobs, while the codes in the epilog scripts were provided to start HTCondor jobs when the Slurm jobs are finished.The architecture of overlap is shown in figure 1.
The key configurations of overlap is listed in table 1.
From the test results showed in figure 2, it can be seen that HTCondor jobs could be run in the Slurm cluster with overlap.

Flocking
According to HTCondor User Manual [11], Flocking is "HTCondor's way of allowing jobs that cannot immediately run within the pool of machines where the job was submitted to instead run on a di↵erent HTCondor pool".The di↵erence between overlap and flocking is that flocking treats worker nodes from the Slurm cluster as the second pool, while overlap sees the worker nodes from Slurm cluster as the same pool.Figure 3 shows the architecture of flocking.Similar with overlap, flocking also has the resource confliction issue to deal with.In order to address the issues, the same prolog and epilog scripts as the ones for overlap were adopted in the Slurm cluster.Table 2 lists the key configurations of flocking.
Flocking was tested, and the results showed that when there are no available resources on the worker nodes of the HTCondor cluster, the jobs were scheduled to the worker nodes of the Slurm cluster which is seen as the second pool of HTCondor.Figure 4 shows the test results.

HTCondor-C
HTCondor-C is the second native way to connect HTCondor with other batch systems, for example, Slurm in our case.After HTCondor jobs were migrated to the job queue of the Slurm cluster, HTCondor could get status of the migrated jobs, at the same time, these migrated jobs were treated as Slurm jobs and managed according to Slurm scheduling algorithms.Test results of HTCondor-C is shown in figure 6.

Comparison of overlap, flocking and HTCondor-C
HTCondor jobs could be migrated to and running in Slurm cluster with all the three methods.However, there are preference considering configuration, management and scheduling polices.Table 4 shows the comparison results of overlap, flocking and HTCondor-C.It can be seen that though overlap and flocking are easier to configure, there are almost no mechanisms to customize scheduling algorithms which is quite important for the next development step.It would be better if customized scheduling algorithms could take e↵ect, so that only qualified jobs could be migrated and run in the Slurm cluster.Besides, if circumstances changed, it would be more flexible to adjust with customized scheduling algorithms.For example, when there are more resources from the Slurm cluster provided for the migrated jobs, we could bring heavier workload to the Slurm cluster by adopting a greedy scheduling algorithm.
On the contrary, HTCondor-C provides job-router [11] to set some parameters related with job scheduling.Besides, a customized Slurm SPANK plugin [12] for job launching could be implemented.SPANK provides a generic interface to dynamically modify the job launch code in Slurm.Developers only need to compile SPANK plugins against Slurm's spank.hheader file [13] and configure plugstack.conffile, and these plugins will be loaded at runtime during the next job launch.As a result, HTCondor-C is more preferred based on the requirements.

Job scheduling issues and possible solutions
The next step after job migration is job scheduling and job running.In this step, three issues are confronted, which are large job quantity, resource sharing and system environment.The following subsections describe the issues and possible solutions.

The issue of large job quantity
HTCondor is designed to schedule High Throughput Computing jobs, which means that HT-Condor is good at dealing with large quantity of single-core jobs.On the contrary, Slurm is suitable to schedule multi-core jobs with high degree of parallelism while not good at handling large quantity of jobs [14].
To solve this issue, with HTCondor-C adopted, some limitations could be set in the job_router daemon of HTCondor, such as the maximum number of jobs allowed to migrate to the Slurm cluster.On the other side, a Slurm SPANK plugin could be developed to choose proper jobs from HTCondor to launch.With this SPANK plugin loaded, during the job selection procedure, if the workload of the Slurm cluster is heavy, the number of jobs to be launched could be limited.As a result, lighter scheduling burden is added to Slurm.

The issue of resource sharing
Computing resources of the HTCondor cluster and the Slurm cluster are funded by multiple experiments, and experiments supported by HTCondor and Slurm clusters are di↵erent.When some qualified HTCondor jobs are migrated to the Slurm cluster, a question is arisen that which resources should be allocated to the jobs submitted from di↵erent experiments.
To answer this question, a possible solution is to provide an extra Information System.The Information System is aimed to provide resource sharing information among multiple experiments.With this information, Slurm is able to allocate resources to the migrated jobs after share permission is checked.

The issue of system environment
As mentioned in the section 4.2, experiments supported by HTCondor and Slurm clusters are di↵erent.As a result, the system environment requirements to run jobs are di↵erent.If HTCondor jobs are migrated to Slurm cluster and ready to run, di↵erent system environment could be an issue.A possible solution to this issue is to adopt the container technology to provide lightweight virtual system environment.To make it working, both Docker and Singularity were investigated, and Singularity is more preferred for the moment because it's safer and supported by Slurm [15][16].

Conclusion
There are two aims of integrating workload between HTCondor and Slurm clusters.One is to improve resource utilization of the Slurm cluster, the other is to reduce job queued time of the HTCondor cluster.To testify that the integration is feasible, three ways were investigated to bridge HTCondor and slurm clusters.And HTCondor-C is more preferred because of rich customized scheduling strategies including HTCondor job-router configuration and Slurm SPANK plugins.To schedule and run HTC jobs on Slurm Cluster, three issues were proposed about large quantity jobs, resource sharing and system environment, and possible solutions are in progress.For the next step, singularity will be tested and developed more deeply to solve the system environment issue, later a Slurm SPANK plugin will be implemented for the large job quantity issue, and the resource sharing Information System is coming at last.

Table 1 :
Key configurations of overlap.

Table 2 :
Key configurations of flocking.

Table 4 :
Comparison of overlap, flocking and HTCondor-C