Limits of the HTCondor Transfer System

In the past, several scaling tests have been performed on the HTCondor batch system regarding its job scheduling capabilities. In this talk we report on a first set of scalability measurements of the file transfer capabilities of the HTCondor batch system. Motivated by the GlueX experiment needs we evaluate the limits and possible use of HTCondor as a solution to transport the output of jobs back to a submitter as a function of the number of concurrent jobs, the size of the output and the distance between a submitting node and computing node. We conclude that HTCondor transfer system satisfies the current GlueX computing demands.


Introduction
GlueX[1] is a high energy photoproduction experiment at Jefferson Lab. It employs a 9 GeV linearly polarized photon beam interacting in a liquid hydrogen target to produce a broad spectrum of hadronic final states. Charged and neutral particles produced as decay products from these interactions are detected in a high-acceptance magnetic spectrometer designed to identify and reconstruct exclusive final states from the lightest mesons up to the charm threshold. The foremost scientific goal of GlueX is to find definitive evidence of hybrid mesons, i.e. quark-antiquark states with excited internal gluonic degrees of freedom.Lattice QCD calculations estimate the masses of the lightest hybrid to be in the mass range 1.5-2.5 GeV/c. Past searches for these states over the past 30 years, mostly with pion beams, have found evidence for resonance behavior in a few isolated channels, but the number and pattern of mass splittings for this exotic new class of hadrons remains unclear. Resolution of this important question demands new experimental data with a complementary probe to independently verify the existence of the reported states, and search for the other members of the expected flavor multiplets. GlueX began physics data collection in February 2017, and is now nearing the completion of its initial exploratory run. Analysis of these data is now underway.
The GlueX collaboration stated that in order to fulfill their physics program their Monte Carlo simulation production would require three million of CPU hours for the 2016/2017 campaign. Given the embarrassingly parallel characteristic of those Monte Carlo simulations the Grid was picked as one of the possible sources of computing resources for these workflows. The GlueX collaboration had already been using the Open Science Grid [2] to run some of the analysis workflows. However one of the challenges of the Monte Carlo workflows is to retrieve the middle products (output of the jobs) back from the compute nodes for storage and further analyzing which can be as ten times bigger in size than their analysis counterparts.
Since GlueX already had experience using GlideInWMS [3] the question to be answered is whether the HTCondor transfer system [4] can handle this load. In the HEP community tools like gfal and globus-copy have been the preferred choice in order to send large output files from grid jobs back to a server running GridFTP [5], Xrootd [6] or some other web facing interface. These tools have several disadvantages: first is that the jobs (or the experiment frameworks) are the ones in charge of sending their output back so if the stage out fails the whole job fails and it has to be manually resubmitted and secondly another external service has to be operated by the collaboration. The HTCondor transfer system takes care of retransmits of the output (if there are network problems) and the submit host (Schedd) is the same service ran to submit the jobs so there is no overhead for the experiment. We organize this paper as follows in section 2 we quantify the computing infrastructure needs for the Monte Carlo Production. Then on Section 2.1 we translate the quantified MonteCarlo needs into a computing large scale test with their parameters. Finally we present the results of the test in section 3.

GlueX Monte Carlo requirements
The first step to test the capability of the HTCondor transfer to handle the load produced by GlueX MonteCarlo simulation workload is to parameterize them in terms of: number of cores, output size per job (simulations are single threaded) and average running time of a job. The reason being is that these three parameters single handily define the average write ratio (W) of output to the submit host as captured in equation 1. During the GlueX data challenge of April 2014 the appropriate parameters for the GlueX needs are summarized in Table 1. Combining the two and assuming a worst case scenario for the number of events per job and at most 20k cores available for the simulations (between GlueX owned and OSG opportunistic resources), we obtain the worst case result captured in equation 1. The size of the output of a job is calculated as 30k events per job, times 3kB per event resulting on 90MB of output per job, the length of a job is calculate as 9h times the number of seconds on an hour.

Scaling Problem
In order to scale the HTCondor transfer system at different levels (concurrency, job output size or job length) we realize first that from equation 1 we can have shorter length jobs that are writing at higher rate or jobs writing larger files with less number of parallel jobs. For this testing we setup a submit host connected to a GlideInWMS system that used the UCSD CMS T2 sleeper pool. A sleeper pool is made out of a batch system pool in which nodes  are oversubscribed with more jobs than the number of physical cores, under the premise that those extra slots will consume a negligible amount of resources(CPU, memory and disk). This configuration allows us to reach high levels of concurrency at the network level reducing the chances of many jobs landing on the same node and being bottle necked by the NIC (Network Input Controller) on a single compute node while not disturbing the Tier 2 production duties. The specifications of the testing submit node are included in Table 2.

Scaling Results
The first question we answered was can HTCondor take advantage of the network, i,e given enough concurrency and perfect network between compute nodes and submit nodes can HT-Condor utilize the capabilties of the NIC (i.e parameters of Table 3) to transfer data from the compute nodes into the submit nodes. The positive results are seen in Figure 1. Notice that the expected write ratio W is ten times the expected worst case scenario derived in section 2 and the achieved rate is also higher than the expected worst case rate. This was due to HTCondor doing re-transmissions, however they happen fast enough there is no visible loss in performance i.e cpu efficiency is close to 99. See next Section.

Latency Measurements
Once we knew that HTCondor Transfer system achieved the required performance given ideal network conditions the next step is to see what happens in the more heterogeneous Grid environment were compute nodes are accessed via the WAN with latency's up to 80ms from the submit point, for example a submit host at the Jefferson Laboratory and a running workflow at UCSD. The metric that we study is cpu efficiency lost, defined as the total wall Red line is at twice the expected write rate W, the yellow line is at the expected rate and the green line is at the expected rate after the network kernel tunning time a job occupied a core for transfers (input and output) over the total wall time the job actually used the core for computing. For this test we fixed parameters so we could test at: the worst case scenario for GlueX write rate (55MB/sec) and twice the rate at different latency levels between the compute nodes and the submit host. On our test the submit hosts and the compute nodes were located in the same computing room, so we add an artificial latency on the submit host interface. We found there is an increasing CPU efficiency loss with an increase of latency (i.e distance matters). But it can be mitigated through some kernel parameter exchanges include in table 4. The results on the HTCondor transfer system at several latency points are summarized in Figure 2.