dCache: from Resilience to Quality of Service

A major goal of future dCache development will be to allow users to define file Quality of Service (QoS) in a more flexible way than currently available. This will mean implementing what might be called a QoS rule engine responsible for registering and managing time-bound QoS transitions for files or storage units. In anticipation of this extension to existing dCache capabilities, the Resilience service, which maintains on-disk replica state, needs to undergo both structural modification and generalization. This paper describes ongoing work to transform Resilience into the new architecture which will eventually support a more broadly defined file QoS.


Introduction
dCache [1,2] is an open-source distributed storage system, designed and proven to be dynamically scalable to hundreds of petabytes in capacity. The system can be configured as a stand-alone disk-only system or in combination with an arbitrary tape storage system. In the latter configuration, dCache plays the role of a distributed disk buffer that hides inefficiencies associated with the high latency and sequential nature of tape access from the end user. Additionally, dCache can be configured as a multi-tier storage solution incorporating storage devices of varying throughput capacities with seamless migration of the data between the tiers. dCache has been used predominantly by the scientific community to store data for research in a wide variety of disciplines, performed by groups ranging from a few individuals to international collaborations of many thousands of members, and with different work-flows, access, capacity and data preservation requirements, resources and budgets. Therefore, there can be a considerable difference in the scientists' desires and expectations about how this data is stored and made available for analysis. These expectations could be in terms of durability (likelihood of data loss), total bandwidth (aggregated over all clients), bandwidth available to a single client, access latency or some combination of these factors.
Some expectations may be based on intrinsic properties of the data; for example, raw observational data may be extremely precious because it is impossible to reproduce. The scientists may accept additional storage costs or increased latency if this reduces the likelihood of data loss. Other expectations may be based on current activity; for example, if a particular set of files will be read by many concurrent analysis jobs, then storing multiple copies of these files allows dCache to provide sufficient aggregate bandwidth to satisfy all clients. Similarly, if a small amount of data is expected to provide input to an I/O-bound analysis job, then it may be useful for that data to be stored on specialized, low-latency media.
Exposing all possible storage options would overwhelm users. Thus it is desirable to have specific, fixed data placement strategies that provide particular storage and performance characteristics. These different strategies are called QoS (Quality of Service) Classes [3]. By allowing scientists to choose and modify with which QoS Class they would like their data stored, a storage system gives scientists the ability to achieve the optimal storage strategy within the available storage capacity.

The dCache Resilience Service as QoS Service
"Resilience" [henceforward referred to in capitals without quotation marks] is a dCache subsystem that achieves data durability by maintaining permanent disk replicas independently of the presence of a back-end or tertiary storage system [4]. As implemented, Resilience relies on a partitioning of storage in such a way that files, pools and pool groups are either "resilient" or not, with the service only being responsible for handling the former.
With the advent of interest in providing general quality-of-service management, it was quickly understood that file resilience comprised a subset of the associated and more broadly defined objectives. In design discussions of what will henceforward be referred to as the "QoS Engine", the reuse of as much of Resilience as possible was thus held to be desirable; there were, however, considerable obstacles in its path. A significant amount of Resilience's architecture needed to be re-factored in order to be able to place on top of the already implemented functionality the new set of semantics concerning file status and QoS requirements.
Although the specification of QoS in dCache remains an unsettled area of ongoing investigation and deliberation, it has nevertheless been agreed that, in advance of implementing any new requirements layer, what exists in Resilience must be reworked for easy adaptation.
The following discussion will demonstrate the changes we are making to Resilience in order to transform it into a set of QoS components supporting the QoS Engine. All of the previous Resilience functionality will be retained in this transformation, and after sufficient trial in a production environment, the former Resilience service will be deprecated and replaced by the new QoS Engine. At present, the prototype is being reviewed for inclusion in the dCache code base.

Architectural Changes to Resilience and QoS Component Interactions
Since the original purpose of Resilience was narrowly defined, there was no need for a clear separation into components. One of the original considerations for keeping much of the functionality together was, in fact, a concern over efficiency; in particular, we wished to minimize inter-service communication via messaging. During the course of use in production, however, it was discovered that some revisions to the original communication model (particularly a significant increase in the number of messages sent to the pools) were necessary in order to guarantee that the information from the namespace obtained by Resilience was actually consistent with that on the pools. There were subtle timing issues which could under exceptional conditions lead to incorrect replica caching. Now that we have had to enact more aggressive verification via messaging, the additional communication between separate components becomes less significant, masked by the traffic to each pool in the pool group during any given file update. The implicit component layers in Resilience have thus been teased apart. These are based on the following functions: 1. determining the number of replicas required; 2. verifying how many viable replicas there are in the current state of the system; 3. making the adjustments necessary to meet the requirements; 4. reacting to changes in pool status and periodically checking for consistency.
For the QoS Engine, at least the first of these needs to be independent from the rest of the mechanism that maintains a file's QoS requirements or which transitions it from one set of requirements to another. But optimal design would also argue for the loose coupling of all four components.
In our 2017 paper [4], we presented the architecture of the Resilience service (Figures 1,2).
The left-hand column of the diagram in figure 2 represents the classes (handlers and map) dedicated to processing incoming updates for replica consistency. The right-hand column represents the classes (handler and map) responsible for intercepting pool status changes or for periodic scheduling of pool scans. The middle column represents the data components used by Resilience to determine file and pool state.
For the QoS Engine, we have split out the implicit resilience functions into these explicit components and interfaces (the first function named above has been further subdivided, for reasons we will make clear below): It is important to note that the responsibilities of each component now have to do with the entire dCache namespace, not just with files that have been marked as resilient by their storage requirements and actual location within designated pool groups. Note also that Resilience did not trigger the migration of a file to an HSM-backed pool for flushing, because it was not intended to handle the dynamic transitioning of a file from one QoS definition to another. Figure 3 gives an overview of how the basic Resilience components have been remapped into these QoS roles. As may be apparent from this diagram, only the message receiver and the pool scanning operations directly converted to the new components; otherwise, some effort was required to split apart internally the left and center stacks on the original diagram. In particular, the File-OperationHandler and FileOperationMap originally managed both replica verification and adjustment; in addition, elements of the handler conducted a triage on the basis of file requirements it also fetched from the embedded data map and the namespace. Figure 4 illustrates the interactions between the newly defined layers. We have grouped the first two components into what we are calling the "QoS Engine" because it serves as the head or entry point service. Each of the four components can be run as separate dCache services, with the usual flexibility as to their distribution in domains and over nodes. We have also provided a standalone version of the entire stack where all components are plugged into each other in the same JVM, avoiding message passing. This can be used on systems where horizontal scaling is less important but hardware resources are limited (the standalone service requires more memory, of course).
Just as the heart of Resilience lay in the components which queued the file operations and determined how much additional work was necessary to fulfill their replica requirements, the core controlling component in the QoS Engine is the verifier, to which all the other components report. It is the verifier which decides when a QoS request has completed or failed; it passes off work to the adjuster one action at a time. This re-factoring also afforded us the opportunity to revise the verifier's state machine in order to allow for greater compatibility with manual migration during pool draining (which was somewhat problematic in Resilience).
The time-line diagram in figure 5 illustrates the processing of new files or modification requests: 1. receive message (cache update or QoS modification); 2. check the file requirements;

request verification;
4. verify the status of the file (how many replicas, on tape, etc.) on the pools (and recontact provider on iteration);    A crucial aspect of the separation into components is that it permits an easier redefinition of the QoS Engine "peripherals"--that is, the adjuster and provider layers.

QoS Adjuster
This service encapsulates a number of tasks which either call out to the pools directly to change the replica's meta-data (such as the "sticky" bit designating its permanence on disk), to copy it or migrate it for flushing, or to request staging through the PinManager. The tasks are rather simple and rely on other parts of dCache to do most of the work. Nevertheless, they do contain some logic (such as synchronous waiting for a reply from the PinManager) which may in the future need to be modified or shifted to other areas of dCache as well for the sake of better consistency and code reuse. Unlike in Resilience, this functionality is now easily modified because it has been isolated.

QoS Provider
By far, the isolation of this function into a distinct API was the principal motivation behind the redesign. In the prototype version under review, that API is implemented using logic nearly identical to what is in Resilience: that is, a file's requirements are defined by a combination of namespace attributes (AccessLatency and RetentionPolicy) and membership in a storage group, for which is expressed the number of replicas and their distribution. Consequently, the QoS states that are available in the prototype are defined in table 1.
On the basis of these available states, the transitions listed in table 2 have been implemented.