Conditions evolution of an experiment in mid-life, without the crisis (in ATLAS)

The ATLAS experiment is approaching mid-life: the long shutdown period (LS2) between LHC Runs 1 and 2 (ending in 2018) and the future collision data-taking of Runs 3 and 4 (starting in 2021). In advance of LS2, we have been assessing the future viability of existing computing infrastructure systems. This will permit changes to be implemented in time for Run 3. In systems with broad impact such as the conditions database, making assessments now is critical as the full chain of operations from online data-taking to offline processing can be considered: evaluating capacity at peak times, looking for bottlenecks, identifying areas of high maintenance, and considering where new technology may serve to do more with less. We have been considering changes to the ATLAS Conditions Database related storage and distribution infrastructure based on similar systems of other experiments. We have also examined how new technologies may help and how we might provide more RESTful services to clients. In this presentation, we give an overview of the identified constraints and considerations, and our conclusions for the best way forward: balancing preservation of critical elements of the existing system with the deployment of the new technology in areas where the existing system falls short.


Introduction
The ATLAS experiment [1] is operating at the Large Hadron Collider (LHC) at CERN. AT-LAS has been successfully collecting collision data for the past 8 years, and in the next few years there will be significant upgrades of the LHC machine and of ATLAS subdetector systems, bringing an increase in luminosity and in the rate of collected data. The processing of the experimental data collected by ATLAS requires a wide variety of auxiliary information from many systems (e.g. Detector Control Systems (DCS), Trigger and DAQ, Data Quality, the LHC accelerator and ATLAS sub-detectors) stored in the ATLAS Conditions Database. Such pieces of information are heterogeneous both in data type (e.g. standard integer, floating point and string types, database-standard BLOBs, CLOBs, and types special to the Conditions database infrastructure like references to external POOL and ROOT files) and in time granularity (spanning intervals from minutes to hours in duration).
During the LHC Run 1 and Run 2, the ATLAS Collaboration deployed and used a system based on the LHC Computing Grid (LCG) Conditions Database infrastructure and the COOL API [2], a C++ library based on a software layer called CORAL which manages the actual database access and the queries that should be issued (hiding the SQL complexity from clients) and supports multiple relational platforms (SQLite and Oracle). We found that this architecture worked well so far, but is showing signs that it will not scale to cope with the processing of the increasingly complex and intense data flows coming with the growing LHC luminosity.
For this reason, the ATLAS Conditions Database management is evolving to a new Conditions Database system [3] which is based on RESTful 1 client-server interaction and has an architecture which uses an intermediate server that disentangles the business components dealing with the database management aspect of the client. This allows the client and server to evolve separately via well-identified interfaces. In the new framework, the database access layer is implemented at server level and the exchanged network traffic will be conditionsdatabase specific (instead of the present generic SQL), allowing to profit in a better way of the parameters used to retrieve the data, which are today completely invisible inside the SQL statement; this element can be relevant for caching optimization. During the intermediate transition phase, we are developing a set of new tools to permit RESTful access to COOL/CORAL for preserving the functionality of the present system.

The ATLAS Conditions Database during LHC Run 1 and Run 2
ATLAS conditions data are stored in a relational database. The database design is based on the LCG Conditions Database and is accessed by clients using the COOL API, both of which were developed by the CERN IT department for the LCG [2]. COOL is a C++ API library based on the CORAL access layer. It provides high level functionality which allows users (the expert scientists that manage the conditions data for any ATLAS sub-detector) to create their own COOL Database and fill it with payload data corresponding to a given time range over which that data is valid (the IOV, or Interval of Validity). Using the COOL terminology, a COOL Database for a given system is called a schema, and the database tables dedicated to a given set of parameters inside each schema are called folders. A folder containing data which can only have a fixed, unchangeable value in each time interval is defined as a singleversion (SV) folder: IOVs may only be appended and data cannot be overwritten. A folder is defined as a multi-version (MV) folder when the payload data can have multiple versions in any given IOV: each new version of the data is entered under a distinct folder tag. For data processing involving many folders, a higher level global tag is defined which contains all the folder tags to be used: this simplifies task configuration because event processing generally requires the conditions data from over 150 different folders.
Conditions data are concurrently accessed by a large number of clients: the majority are event processing jobs using Athena [4] the ATLAS event processing software framework. Database access is managed via the COOL API using the intermediate Frontier/Squid [5] services, as shown in Figure 1.
Database queries which are created via the COOL/CORAL stack are propagated as parameters inside a URL and sent as HTTP requests to the Frontier server. Access to the Oracle Database is managed via a Java DB Connectivity (JDBC) layer. Since each request with an identical URL will retrieve the same data from Oracle, a simple web caching layer is implemented via squid proxy.
The database design and the data access model have been generally stable through LHC Run 1 and for most of Run 2, but in the later stages of Run 2 dips in data access efficiency were observed. This led to a general survey of the system to assess how to improve current operations as well as the viability of the current infrastructure for future operations. We found they current system is problematic in a number of areas: • Global management of the system is made more difficult because the conditions data are contained in over 30 schemas and stored in over 10,000 underlying folder tables in the database. The implementation of global tags in the LCG infrastructure proved cumbersome in ATLAS because of the way in which it is implemented at the database level.
• The granularity of the conditions data in the current system is not well suited to the caching mechanisms of Frontier.
• The support from CERN IT of the underlying software stack (COOL and CORAL) is decreasing and will stop at the beginning of Run 3. Maintenance of the complexity of the current implementation will be magnified as new data is added to existing schemas and as new detector systems come online. We also are concerned about the possible implications in terms of data preservation.
These considerations lead us to evaluate new database models and architectures on the time scale of the end of Run 3.

The new Conditions REST infrastructure
A new REST-based architecture for the management of ATLAS conditions data is now being developed. The new Conditions REST (CREST) architecture enhances the role of the Frontier servers, as shown in Figure 2. CREST development started in collaboration with the CMS experiment [6] who have been successfully using this underlying database model for storing and accessing conditions data since the beginning of LHC Run 2. In this architecture, the client is not aware of the underlying persistence technology used, and interacts with the storage via functions implemented at server level, providing only the set of conditions data or metadata that are needed directly inside the URL (or requests body or headers) of the HTTP method (GET, POST,...) used. Having the client compose requests via an abstraction layer above the SQL also allows alternative storage systems to be swapped on the server side in the future without changes needed on the client side. We call this set of functions the CREST API; the API has been written using OpenAPI [7] specifications (in short, a JSON file describes the URLs and their parameters as well as the Request and Response content). This enables us to take advantage of code generation tools: in our case, the client library (in