Best Practices in Accessing Tape-Resident Data in HPSS*

Tape is an excellent choice for archival storage because of the capacity, cost per GB and long retention intervals, but its main drawback is the slow access time due to the nature of sequential medium. Modern enterprise tape drives now support Recommended Access Ordering (RAO), which is designed to reduce data recall/retrieval times. BNL SDCC's mass storage system currently holds more than 100 PB of data on tapes, managed by HPSS. Starting with HPSS version 7.5.1, a new feature called “Tape Order Recall (TOR) has been introduced. It supports both RAO and non-RAO drives. The file access performance can be increased by 30% to 60% over the random file access. Prior to HPSS 7.5.1, we have been using an in-house developed scheduling software, aka ERADAT. ERADAT accesses files based on the file logical position order. It has demonstrated a great performance over the past decade long usage in BNL. In this paper we will present a series of test results, compare TOR and ERADAT's performance under different configurations to show how effective TOR (RAO) and ERADAT perform and what is the best solution in data recall from SDCC's tape storage


Effectively Using Tape Technology
The Tape Storage System at Brookhaven National Laboratory [1] Scientific Data and Computing Center (SDCC) [2] provides it's service to the scientific experiments at RHIC [3] and LHC[4] (CERN, Geneva). The amount of our science experiments' data has increased rapidly and we currently have about 150 PB of data in our tape storage. We have put a great amount of effort into how data is saved onto tapes and how to optimize data mining and data production work flows, from a production account perspective, taking into account the time sequence and ordering of files on tape.
As the amount of our scientific experiments data is projected to be increasing very rapidly, the requirement of using tape storage is also becoming more challenging. Therefore we need to examine the architectural design of our tape storage to optimize the data archival system.
We have a long history of using LTO [5] technology, but other new drive and medias are worth to re-evaluating as well. Especially the Recommended Access Order (RAO) is one of the new features we need to evaluate.

Software
BNL's archival storage system is managed by hierarchical storage management (HSM) software called HPSS [6]. HPSS is a software developed by IBM in collaboration with various DOE National Labs [7].
In addition to HPSS, we also have a scheduler software, called Efficient Retrieval and Access to Data Archived on Tape [8] ( ERADAT). ERADAT is a file retrieval scheduler, built with HPSS API calls. ERADAT is the interface between the user and HPSS. [8]. As demonstrated in Fig 1 ERADAT aggregates all the staging requests by tape cartridge, and then sorts the requests by file logical offset, so that all the requests on the same tape can be read at once, hence to reduce redundant tape mounts. ERADAT is using linear offset ordering to access data, since this is the how LTO drive accesses the tape data.
Our goal is to find the best method that is suitable for our production environment. We tried to simulate the production system as much as possible, so we are evaluating the tape technologies by using HPSS, with ERADAT as the interface. All test data were taken from our production data.

Hardware
We have evaluated 3 different kind of tape technologies: • LTO-7, T10K-D and TS1150+JD media.
• We use similar disk buffer for all 3 tape-technologies.

Terms and settings
RAO: Recommended Access Order (RAO), a new feature supported by enterprise drives like T10K-D and TS11 series drives. When accessing enterprise level tapes with TOR turned on, HPSS stages files based on RAO.
Offset order: accessing data based on logical offset on the tape, sequentially.

HPSS Staging modes:
Default by default TOR (Tape Ordered Recall) is enabled, HPSS will use the enterprise drive built in feature RAO (if available from the drive) or make linear offset ordering (if RAO not available), to schedule and submit the staging requests.
Noschedule TOR is disabled, staging request will not be ordered before submitting to HPSS.

Staging coverage:
100 % stage All files on the tape will be recalled.

50% stage
Half of the files on the tape will be recalled with even spacing method, i.e. retrieve every other file.
10% stage 10% files on the tape will be recalled with even spacing method.
We believe even spacing will give us a good reference baseline. It could also imply the worst situation. Fig 2 illustrated a tape with 2 wraps, total 40 files. We want to stage only 10% of the files from this tape, the 4 sample files were 10 files apart from each other. Besides the even spacing, we also did a few tests with random spacing.

Recall request submission:
ERADAT recall ERADAT tool is used to sort the recall requests based on file's logical position order before submitting jobs to HPSS for staging. Batch submission size is 3, i.e. 3 files are concurrently submitted to HPSS for staging. TOR is disabled, i.e. Job is run under noschedule mode.

Test Data
We used 3 sets of test data: Large file, small file, and small file with aggregation.
Large Files 10 GB per file, none aggregated. A dedicated tape was fully written with the same 10G file.
Aggregated Small Files 1 GB file migrated to a dedicated tape with aggregation policy set as following: Max. file size: 2GB, Max number of files in aggregate: 30 Non-aggregated Small Files 1 GB file migrated to a dedicated tape without aggregation set.

Test Methods
All tests were timed for staging only: files read from tape and written on disk, no network data transfer is involved.

TOR disabled
Submit 3 requests to HPSS. When a file is completed, submit another one. Error! Reference source not found. illustrate the file submission logic. When file A is completed, file B will be activated immediately since B is already queued in HPSS. While B is being read from tape, file C will be queued in HPSS. Error! Reference source not found. is quoted by "Efficient Access to Massive Amounts of Tape-Resident Data" [9] Typically, 2 threads should be enough. Since LTO-7 and TS1150 are much faster than LTO-6, we increased the queue depth to 3. Fig 3 illustrated a 3 threads buffer, while File A is being staged; File C is waiting in queued. As soon as File A is finished, File C will be the next one in the line.

Fig 3 Buffer queue
In Fig 4, while File A is retuned, File C should be already being staging, thread 1 should be updating the status for File A, and then submit File F. With LTO-7, 300 MB/s, we use a 3 threads model, to reduce latency. A recommendation is to not over allocate the buffer as it will not be helpful but wasting thread's resources.

TOR Enabled
Submit 70 bulk requests to HPSS. When a file is completed, a new file will be submitted to keep 70 requests full filled per tape in HPSS.

Large File -10 GB
When staging large files, we observed some performance improvement with TOR. However, the gain became less visible when staging more files from tape. Table 1 shows the performance difference when using TOR.

Small File 1 GB
When staging small files, we observed large performance decrease with TOR. See Table 2.

Small File 1 GB, Aggregated
When staging small files, we observed performance decrease with TOR. See Table 3.

Large File -10 GB
We observed significant performance gain when staging large files. However the performance gain became less and less as the number of files increased. Table 4 shows near 60% gain when staging 10% files.

Small File -1 GB
We also observed performance gain when staging smaller files from 10% coverage. However the performance gain became less and less as the number of files increased. Table 5 shows near 43.26% gain when staging 10% files, but became negative when staging more than 50% files back.

Small File -1 GB Aggregated
It's a surprise to see huge performance drop when using RAO to stage aggregated files in HPSS. Table 6 shows the poor performance in all scenarios.

Random Spacing
We also tested random spacing staging case. Source files list were scrambled, and randomly chosen 10% of the files with Linux shuf command. For example: shuf -n 80 file_list.
As expected, random spacing gives much better performance with both RAO on and RAO off. See Table 7 and Table 8 for details.  As expected, RAO performed very poorly with HPSS small file aggregation, compare to linear offset ordered, showing in Table 9 below.

RAO and HPSS small file-aggregated block
We have discussed the small file aggregation problem with IBM HPSS team, we believe the problem was caused by the auto-seek when RAO was turned on. When auto-seek is enabled, the read head automatically moved to next block as soon as a file is read, and HPSS would have to pull it back to the beginning of the block and seek for next file. Green represented a single aggregated data block. As the auto-seek is in effect, the read head automatically jumped to the beginning of next block. According to IBM, they claim this bug will be fixed in next release.
When auto-seek is off, the read-head could just seek to next position from where it was left.

Fig 6
If auto-seek is off, we expect to see a better performance in small file aggregated block.

Best Staging Practice
RAO may be a short term solution for enterprise drives like IBM TS11xx. It works great when staging small amount of data. The best long term solution of using tape is to stage bulk amount of data, stage 50%, or more per tape-mount. The enterprise media costs more than LTO in dollar per TB, but the performance gain seem to be very positive. However, since we are a LTO user, we still cannot take any advantage from RAO. Our best staging practice is to increase the bulk requests, in order to get more performance gain.