The eXtreme-DataCloud project-solutions for data management services in distributed e-infrastructures

The eXtreme DataCloud (XDC) project is aimed at developing data management services capable to cope with very large data resources allowing the future e-infrastructures to address the needs of the next generation extreme scale scientific experiments. Started in November 2017, XDC is combining the expertise of 8 large European research organisations. The project aims at developing scalable technologies for federating storage resources and managing data in highly distributed computing environments. The project is use case driven with a multidisciplinary approach, addressing requirements from research communities belonging to a wide range of scientific domains: Life Science, Biodiversity, Clinical Research, Astrophysics, High Energy Physics and Photon Science, that represent an indicator in terms of data management needs in Europe and worldwide. The use cases proposed by the different user communities are addressed integrating different data management services ready to manage an increasing volume of data. Different scalability and performance tests have been defined to show that the XDC services can be harmonized in different contexts and complex frameworks like the European Open Science Cloud. The use cases have been used to measure the success of the project and to prove that the developments fulfil the defined needs and satisfy the final users. The present contribution describes the results carried out from the adoption of the XDC solutions and provides a complete overview of the project achievements. © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). EPJ Web of Conferences 245, 04010 (2020) CHEP 2019 https://doi.org/10.1051/epjconf/202024504010


Introduction
The eXtreme DataCloud (XDC) project [1] develops scalable technologies for federating storage resources and managing data in highly distributed computing environments. The project is run by a Consortium that brings together technology providers with a proven long-standing experience in software development and large research communities belonging to diverse disciplines: LifeScience, Biodiversity, Clinical Research, Astrophysics, High Energy Physics and Photon Science. The project will end on April 2020 after 30 months of activity. It was funded under the European Union H2020 framework program (Call EINFRA-  Research and Innovation action) with a total EU contribution of 3.07 million Euros. XDC developments enrich already existing services for data management in e-infrastructures adding new functionalities as requested by the research communities represented in the consortium and targeting the computing infrastructure built by European Open Science Cloud (EOSC) [2], the EGI Fundation (EGI) [3], the Worldwide LHC Computing Grid (WLCG). An introduction to the project, the description of its objectives, the involved research communities, the services composing its software catalogue and the foreseen technical architecture can be found in [4] [5]. The next section provides a brief project overview, while Section 3 reports the main development achievements of the projects.

The XDC project overview
XDC is a follow up, in the field of data management, of the INDIGO-DataCloud project [6] and continues the INDIGO work on storage Quality of Services definition/handling and on data lifecycle management. It develops smart orchestration tools to realize a policy driven data management in heterogeneous e-infrastructures. XDC develops the building blocks of the software layer that can be used to implement the socalled Data-lakes, intended as storage federation providing transparent or quasi-transparent access to data stored in geographically distributed huge sites connected through a highbandwidth network backbone. XDC is a user driven project and table1 lists the communities represented in its Consortium. Requires the integration of different tools to manage the data lifecycle, the production of data based on FAIR (+R, Reproducibility) principles and the automation of storing in distributed environments data and metadata produced by sensors of different sources (satellites, simulation models, air/water quality monitoring etc)

CTA -The Cherenkov Telescope
Array [11] Astrophysics and Astroparticle physics French National Centre for Scientific Research (CNRS) Requires the creation of a distributed storage infrastructure to store, with multiple QoS, data coming from the two CTA antennas. Metadata extraction and manipulation is also need during the process.
The XDC "toolbox" used to address the requirements provided by the User Communities is a set of services already available as well-know production quality tools or prototypes existing at least at TRL6 (Technology Readiness Level [12]). All XDC products are released with open source licenses at least at TRL8. Table2 summarizes the tools used in the XDC architecture and developed with the contribution of XDC, which developments aimed at improving the functionalities, the performances, the usability and scalability of the all the listed components.

Project Achievements
After an initial phase of requirement gathering which involved all the scientific communities represented in XDC, the project produced two major releases: XDC-1, codenamed Pulsar, released in January 2018 and XDC-2, codenamed Quasar, released in March 2020. Both releases are based on the general architecture that is presented in [4] and improve or add missing functionalities to the set of software components reported in Table2, addressing important topics like federation of storage resources, smart caching solutions, policy driven data management based on Quality of Service, data lifecycle management, metadata handling and manipulation, optimized data management based on storage events. The two XDC releases allow a tighter integration between the participating components in order to enable workflows which exploit the whole data platform as a coherent infrastructure rather than a set of disparate services. Table3   dCache Storage events support with Kafka and SSE, support to inotify events, new plugin for SSE, clients can discover changes in dCache namespace using an interface modelled after the inotify API, dCache View is updated, 3rd party copying functionality more robust and scalable.
Dynafed OIDC support, both as a Relying Party (redirecting a browser to an IdP) and Protected Resource (consuming oauth access tokens for non-interactive access), facilitating in this way the integration with the XDC Orchestrator and allowing browser based access without X509 certificates. This is configured through Apache's mod_auth_openidc. "Fourth party copy" -dynafed can function as the active party for data distribution. This enables services without third party copy support (such as S3) to participate fully in the data distribution infrastructure

RUCIO
Authentication and authorization mechanism extended to support (JWT) tokens using OIDC protocol, to allow permission control downstream (Rucio → FTS3). Implemented internal mechanism using token exchange and token refresh grants. Added user pre-provisioning (via new Rucio SCIM client) implemented as a 'Rucio probe' script. New Rucio daemon implemented taking care of token deletion, token refresh and clean-up of expired authentication OIDC sessions. Third party copy supported in the entire chain Rucio → FTS3 → dCache Caching-ondemand Provides recipes and PaaS description templates for an end to end deployment of an XCache cluster. Container based, support for Docker, Kubernates, Ansible recipes.

PaaS Orchestrator
Added OID Connect support in the interaction with Onedata and Rucio, description Dynafed resources added in TOSCA templates, added scheduling strategy using Dynafed, implemented Cloud Providers Retry Logic, improved description of Onedata spaces, Added the possibility to steer data movement using Rucio. Improved web interface to submit and monitor orchestration requests: "Paas Orchestrator Dashbord".

TOSCA Types and Templates
Extended TOSCA Simple Profile in YAML Version 1.0 to add high level entities. Added new types for Onedata and Dynafed storage resources, added new types for describing a Kubernetes cluster. New Ansible roles implemented, new type for describing a JupyterHub node, updated example templates for deploying Chronos dockerized jobs that use Onedata for managing input/output data.

Conclusion
XDC focused on the continuous interoperability and highest scalability of a wellestablished toolbox to build a European wide data storage and management system for science. Successful efforts have been undertaken to involve community and industry standardization organizations (i.e. RDA) as well as large, well established, data driven communities to guarantee sustainability of the XDC achievements. Scalability is pursued not only through individual components but also through their interactions, so the XDC architecture has been implemented making its components interoperable, for what concern functionalities but also modern and fine grained authentication methods, allowing a distributed model to be fully exploited and the horizontal scaling at the infrastructure level achieved. XDC is active in all the forums and boards where the represented communities discuss the requirements and the implementation of the used e-infrastructures presenting the solutions developed in the last 30 months. Moreover, to guarantee sustainability, all new features have been pushed to their upstream repositories, leading to a wide deployment as soon as the new versions are downloaded and installed from those sources. XDC is also interacting with the projects and entities that are shaping the European Open Science Cloud (EOSC), promoting the developed solutions as the building blocks of the data management systems of this emerging infrastructure.