Using Lustre and Slurm to process Hadoop workloads and extending to the WLCG

. The Queen Mary University of London Grid site has investigated the use of its Lustre ﬁle system to support Hadoop work ﬂows. Lustre is an open source, POSIX compatible, clustered ﬁle system often used in high performance computing clusters and is often paired with the Slurm batch system. Hadoop is an open-source software framework for distributed storage and processing of data normally run on dedicated hardware utilising the HDFS ﬁle system and Yarn batch system. Hadoop is an important modern tool for data analytics used by a large range of organisation including CERN. By using our existing Lus-tre ﬁle system and Slurm batch system, the need to have dedicated hardware is removed and a single platform only has to be maintained for data storage and processing. The motivation and beneﬁts of using Hadoop with Lustre and Slurm are presented. The installation, benchmarks, limitations and future plans are discussed. We also investigate using the standard WLCG Grid middleware Cream-CE service to provide a Grid enabled Hadoop service.


Introduction
The Queen Mary WLCG tier two site has successfully operated a reliable site utilising Lustre and Slurm for several years (more details can be found here [1] [2], [3]). The site has over 5PB of Lustre storage and over 4000 jobs slots all connected by 10Gb/s and 40Gb/s Ethernet network and supports over 20 High Energy Physics (HEP) and Astronomy projects. The main user of the site is the ATLAS experiment [4].
Lustre [5] is an open-source (GPL), POSIX compliant, parallel file system used in over half of the worlds Top 500 supercomputers. Lustre is made up of three components: One or more Meta Data Servers (MDS) connected to one or more Meta Data Targets (MDT), which store the namespace metadata such as filenames, directories and access permissions; One or more Object Storage Servers (OSS) connected to one or more Object Storage Targets (OST) which store the actual files; and clients that access the data over the network using POSIX filesystem mounts. The network is typically either Ethernet or Infiniband.
The Slurm Workload Manager [6], is a free and open-source job scheduler, used by many of the world's supercomputers and computer clusters. At Queen Mary, the batch system is fronted by a Cream Compute Element (Cream-CE) [7] which accepts job submissions from the Grid, interfacing with the local Slurm batch system.

The Problem with Hadoop/Spark
Hadoop is a data distribution and processing framework made up of several tools (Yarn, HDFS, MapReduce) and an ecosystem of related projects (Pig/Hive/Storm/Spark). The Hadoop ecosystem is hugely popular in the commercial world and is under active development. Hadoop clusters are constructed with commodity servers (nodes) with local hard drives to run the HDFS file system. A Hadoop cluster will have at least one "namenode" that controls the resource allocation and many "datanodes" that do the actual data storage and processing. The Hadoop system is designed to tolerate failure of hard drives or nodes by having multiple copies of a data file across different nodes.
There has been an increasing interest in the use of Hadoop/Spark in HEP recently. However, usage in HEP still remains small. There are a number of obstacles that prevent and hinder users from using Hadoop/Spark at scale, • Most computing resources are setup as High Performance (HPC) or High Throughput (HTC) clusters. These do not natively run Hadoop/Spark workloads. • Creating a dedicated Hadoop cluster will take resources, hardware and manpower, away from existing HEP computing needs, there is yet not enough demand to justify this for most providers of computing resources in HEP. • Many HEP users do not have access to the finance to use a commercial cloud solution which could provide Hadoop/Spark services. • Private cloud and containerisation may provide a future solution to resource provision.
However, their availability to the HEP community still remains low.
Many providers of computing resources to HEP now also have to serve other communities beyond HEP which do make greater use of Hadoop/Spark. It would be beneficial to serve these communities without out permanently splitting resources.

A Solution With Magpie
The Magpie project provides a way to integrate Hadoop and Spark into existing HPC and HTC infrastructures. It supports running over any generic network filesystem (including Lustre/NFS/HDFS) and several scheduler/resource manager (such as Slurm/LFS). Magpie is primarily developed by Al Chu at LLNL (https://github.com/LLNL/magpie) and is being actively maintained and developed. It also supports a large part of the Hadoop ecosystem (figure 1) and a very wide range of software versions.
Magpie accomplishes this task by instantiating a temporary Hadoop/Spark cluster within the context of a batch job, rather than on a persistent, dedicated cluster. The batch job will consume job slots to create a Hadoop cluster over several nodes. If required a transient HDFS file system can be created using local scratch space on the node or a network file system. Data can then be copied in to the cluster, processed, and a copy of the results can be made to persistent storage. At the end of the job, all traces of the cluster will be destroyed. At which point, the jobs slots are available for other types of workloads.

Setup and Use
The following will specify a typical recipe used to run a Hadoop job on the Queen Mary cluster via the Grid using a Cream-CE.
1. Hadoop and Magpie software must be available on every node in a cluster. These directories will not be written to, so they can be provided via readonly file system such as CVMFS.

Optimisations
An important choice when creating a HDFS file system is whether to create the file system on local scratch disks or on a network file system. A locally attached hard drive should have high bandwidth to jobs running on the same node. A network file system can aggregate the performance of 100s of hard disks to overcome the penalty of remote, network accessed, storage. Two sets of benchmarks were used to investigate the performance of local vs remote storage options. The benchmarks were carried out with three dedicated nodes each with 40 job slots and three 1TB disks each. Nodes are connected to the network with 10Gb/s Ethernet. Results with Lustre uses the existing production Lustre storage system. The first test uses IOzone [11] in cluster mode, using multiple threads on each node either writing to different directories on Lustre or to different disks on the local node. The Lustre performance is consistently better than that for local disks (figure 2).
Hadoop DFSIO is a standard benchmark program for measuring IO performance that comes with Hadoop. Setting up a Hadoop cluster with one namenode and two datanodes with each data node having three 1TB hard drives per node. The HDFS over Lustre performance is consistently better than that for HDFS using local disks ( figure 3).
In both cases running HDFS over Lustre is quicker than using local disk. While increasing the number of nodes, and hence local disks, will improve the performance of HDFS using IOZONE test showing higher throughput with Lustre than with three nodes using 3 local disk per node.
Hadoop DFSIO benchmark using one name node and two data nodes. local disks are mode up of 3 1TB hard drives per node. Lustre uses the existing Lustre storage system.

Limitations
A number of limitations with the present service have been identified that limit a full scale production service at Queen Mary.
• A User submitting a job must know local details about the site they are using (e.g. mount points, software paths, software versions) so they can provide the magpie_hadoop_script.sh at job submission.
• Requesting whole nodes only works at present, requesting less than a whole node ends up creating one datanode per core on each node causing a significant performance impact.
• The service only provides a transient HDFS file system, as a result, data has to copied in each time a Hadoop job starts, reducing job efficiency.

Conclusions
A demonstration has been made of how to run Hadoop jobs on a HPC/HTC cluster, as typically used in HEP computing, using a Cream-CE for job submission. The service could be expanded to include more software available in the Hadoop ecosystem that is supported by the Magpie project. Such a service may help bridge the gap in available resources for Hadoop/Spark workloads until demand makes building dedicated clusters feasible. A number of limitations with the present configuration were identified but these should not prevent creating a production Hadoop/Spark service if demand develops.