Remote data access in computational jobs on the ATLAS data grid

. This work describes the technique of remote data access from computational jobs on the ATLAS data grid. In comparison to traditional data movement and stage-in approaches it is well suited for data transfers which are asynchronous with respect to the job execution. Hence, it can be used for optimization of data access patterns based on various policies. In this study, remote data access is realized with the HTTP and WebDAV protocols, and is investigated in the context of intra-and inter-computing site data transfers. In both cases, the typical scenarios for application of remote data access are identiﬁed. The paper also presents an analysis of parameters inﬂuencing the data goodput between heterogeneous storage element - worker node pairs on the grid.


Introduction
The ATLAS Experiment at CERN [1] stores approximately 400 petabytes of data across more than 150 data centers within the Worldwide LHC Computing Grid (WLCG) [2,3].More than 5000 grid users submit analysis and production jobs.The data is accessed based on data placement, stage-in and remote access techniques.In practice, the data placement is typically employed for scheduled inter-storage element data transfers [4], while stage-in and remote access [5][6][7] are used by worker nodes locally within a single computing site.However, other data access profiles, such as direct reading from remote storage elements, or arbitrary combinations of the various access techniques within a single computational campaign are possible.

Remote data access
Consider a computational job scheduled to run at the computing site S 1 .The job requires some fraction of a file persisted at the remote computing site S 2 as input.According to the current data access policies of the ATLAS data grid, the complete file first needs to be placed from the remote storage element to the local one.Then, it can be accessed by the worker node in the stage-in manner.In this section, we refer to this procedure as local access.An alternative is to read the relevant fraction of the file directly from the remote storage element at S 2 .Here, we refer to this access profile as remote access.

Goodput break-even point
In this subsection we present hypothetical examples, demonstrating the suitability of local/remote data access profiles in various scenarios.The examples are based on a simplistic grid topology, depicted in figure 1.A grid comprises two sites S 1 and S 2 .Jobs are scheduled to run at the local site S 1 , which consists of the storage element SE 1 and 40 worker nodes w 1 , w 2 , . . ., w 40 .The second site S 2 has a storage element SE 2 (and neglected workers).All the input data needed by computational jobs j 1 , j 2 , . . ., j n is stored in a single file F 1 with a size of 100GB.F 1 is stored in SE 2 .The sites S 1 and S 2 are connected over a 3Gbps WAN link.Each of the storage elements SE 1 and SE 2 is connected to the WAN link over a 3Gbps LAN link.Each of the workers w 1 , w 2 , . . ., w 40 is connected to the local storage element SE 1 and to the WAN link over a 1Gbps LAN link.Each of the storage elements SE 1 and SE 2 has a maximum data throughput of 5Gbps.Each worker w 1 , w 2 , . . ., w 40 has a maximum data throughput of 1Gbps.In the demonstrated examples the same effect is reached, in case the storage element SE 2 has a maximum throughput of 3Gbps, while the WAN link has a bandwidth of >= 3Gbps.The one-way network latency accounts for the end-to-end delay.Between all local machines this latency is 20ms, between all remote machines it is 100ms.The grid employs a reliable data transfer protocol (thus, network latency has a direct impact on the overall data transfer time).For the sake of simplicity, we assume that there is no slow-start phase in the data transfer protocol.We assume that packet switches are not slowing down the transfer process in any way.The packet size is set to 1MB (while the header bytes are neglected), the packet loss rate is 1% over WAN and 0.1% over LAN, while during the second attempt all the packets are guaranteed to be delivered.The protocol employs two asynchronous channels: a data channel and a control channel.The traffic needed for the transmission of control messages is neglected.Every time a lost packet is detected and retransmitted by the sender, the transfer time is increased by the network latency.We assume that the maximum storage capacities of all the machines are not reached, and that there are always available CPU cores for the processing.We assume that storage elements share the maximum data throughput and the WAN link shares the bandwidth equally between all parallel data flows.
In this highly abstracted calculation of the data transfer time [8] we only consider two dominant components: (1) the delay resulting from the first attempt to transfer all the packets; (2) accumulated latencies resulting from the second attempt to transfer the lost packets.The general formula is: where t S→D is the data transfer time from source to destination, d transfer is the aforementioned first component, l is the concerned latency, F is the size of the data being transferred, P is the packet size, R loss is the loss rate.d transfer itself is also calculated on a high abstraction level, by where T is the end-to-end throughput of the transfer, which is fairly determined considering the amount of parallel data flows and the physical limits of hardware.
In the first example, each worker node w 1 , w 2 , . . ., w 40 is assigned a single computational job j 1 , j 2 , . . ., j 40 respectively.Each job j 1 , j 2 , . . ., j 40 requires a distinct fraction of the file F 1 , which is located at the remote storage element SE 2 as input.The size of the requested file fraction is varied from 1 to 50GB in 1GB steps (each neighboring pair of points in the plot is linearly interpolated).In the second example, a varied amount of worker nodes (1-40) is assigned a single computational job.Each job needs a distinct 10GB fraction of the file F 1 as input.Applying the above formula, we get the transfer times depicted in figure 2. We interpret the results as follows.Data caching by means of data-placement minimizes the limiting effects posed by large amounts of parallel data flows on throughput of network links or storage elements.Remote data access helps to avoid increased cumulative latency due to transfer of redundant file fractions.
Furthermore, remote data access is beneficial for the following reasons: • The data is transferred asynchronously with respect to the job execution.Thus, the resource utilization is maximized.
• Such access methodology can help to bypass limited disk space and quotas imposed by virtual organizations by streaming data in chunks without persisting it.
• Remote data access has a smaller coordination overhead.

Observations in production
This section presents experiments conducted in the production grid in order to examine the behavior of remote data access.The data access is realized based on the reliable HTTP/WebDAV protocols.First, we examine the throughput bottleneck given a virtual link between a worker node pool and a storage element.For this purpose, we submit a computational job to the worker node pool of an anonymous computing site in Italy, which we denote by WNP_ITALY.Every 30 minutes, 4 threads commence to concurrently read files of various sizes (100MB, 300MB, 1GB and 2GB) from an anonymous storage element located in the UK, denoted by SE_UK.A single thread reads a single file respectively.Once per sampling day, the job is re-submitted to a randomly selected worker node.The sampling day is divided into 2 phases of equal length.During the first phase, 10 other threads within the same job concurrently read other files of various sizes.During the second phase, only the initially mentioned threads are active.The goal is to test whether the group means of transfer times for the 4 initial files differ among the 2 sampling phases.If this is the case, then the additionally imposed traffic during the first sampling phase causes a bottleneck effect on the given virtual link.The resulting time series for the transfer times of the original files are depicted in figure 3. Given a single initial file and a single sampling phase, the estimated [9] average concurrent traffic (AVG_CT) is reported in table 1.
To test whether the transfer times differ significantly among sampling phases [10], we perform the non-parametric two-sided paired Wilcoxon Signed-Rank Test.Specifically, for each original file, we split the sampled transfer times into two different groups in accordance to the sampling phase.The null hypothesis of the test states that the two samples are drawn from the same population.The test results are reported in table 2. According to the test results, the null hypothesis is rejected.
Further, we investigate whether the detected bottleneck is arising on the side of the worker node or the storage element.The second experiment is performed in the context of other computing sites.The original computational job is submitted to a worker node pool in the USA, WNP_USA.Just like before, once per day the job is resubmitted to a random worker A single thread reads a single respective file.However, this time there are no additional concurrent threads in the original job.Instead, at every second sampling step two computing jobs running in Australian (WNP_AUSTRALIA) and Italian (WNP_ITALY) worker node pools respectively commence to stream files of various sizes.Now the sampling phases alternate at every second time step.The resulting time series are presented in figure 4. Compared to the time series generated by the previous experiment, the transfer times stay relatively constant.However, an interesting observation is that the worker nodes with the MAC address prefix 18:66:DA exhibit a rather unstable behavior.
Given a single original file and a single sampling phase, the estimated average concurrent traffic is reported in table 3.In this case we differentiate between within job concurrent traffic (AVG_CT1) and cumulative concurrent traffic from the 2 other virtual links (AVG_CT2).Based on these test results the null hypothesis is clearly confirmed.Thus, we conclude, that the bottleneck detected in the first experiment originates in the network I/O of the worker nodes.

Conclusions and Future Work
In this paper we have described the current state of the remote data access methodology in the ATLAS data grid.Furthermore, we have motivated its use for inter-computing site transfers, which is currently not deployed.The presented empirical experiments show that the I/O bottlenecks for remote data access tend to be exclusively on the side of worker nodes (given practical workloads).Finally, we have detected the unstable throughput of certain network interface cards.
Currently we are working on the following aspects as part of the future work: • Mathematical formalization and forecasting model for network throughput of all the data access profiles (remote data access, data placement and stage-in).
• Incorporation of the throughput formalization in a simulation model.
• Evolutionary optimization of hybrid data access profiles using memetic algorithms.

Figure 4 .
Figure 4.No bottleneck induced by the storage element.

Table 1 .
Estimated average concurrent traffic within the first experiment.
node.Once per 30 minutes 4 threads within the job concurrently read files of various sizes (100MB, 300MB, 1GB and 2GB) from a storage element in Australia, SE_AUSTRALIA.

Table 3 .
Estimated average concurrent traffic within the second experiment.We have performed an analogous non-parametric two-sided paired Wilcoxon Signed-Rank Test, with the null hypothesis stating, that the transfer times of individual files do not differ among sampling phases.The results are depicted in table 4.