IRIS – providing a nationally accessible infrastructure for UK science

In many countries around the world, the development of national infrastructures for science either has been implemented or are under serious consideration by governments and funding bodies. Current examples include ARDC in Australia, CANARIE in Canada and MTA Cloud in Hungary. These infrastructures provide access to compute and storage to a wide swathe of user communities and represent a collaboration between users, providers and, in some cases, industry to maximise the impact of the investments made. The UK has embarked on a project called IRIS to develop a sustainable e-infrastructure based on the needs of a diverse set of communities. Building on the success of the UK component of the WLCG and the innovations made, a number of research institutes and universities are working with several research groups to co-design an infrastructure, including support services, which take this to a level applicable to a wider use base. We present the preparatory work leading to the definition of this infrastructure, showing the wide variety of use cases which require to be supported. This leads us to a definition of the hardware and interface requirements needed to meet this diverse set of criteria, and the support posts identified in order to make best use of this facility and sustain it into the future.


Introduction
The Science and Technology Facilities Council (STFC) [1] is responsible for funding research in UK universities in the fields of particle physics, astronomy, nuclear physics and space science, and operates large-scale national scientific research facilities. Computing and data handling is essential for the world-leading science supported by STFC and in the coming years most STFC programme areas are expecting a significant increase in data volumes and computing resources. Historically each activity or small groups of activities have independently deployed and managed their own e-infrastructures.
In 2014-2015 STFC conducted a strategic review of its current computing capability and future requirements of its scientific and industrial user communities [2]. This set out a vision for a strategic and coordinated approach for computing and storage. Amongst the recommendations was the goal that the different science disciplines should find ways of sharing computing infrastructure and that new projects should be encouraged to make use of existing expertise and infrastructure. Clearly this approach would help to ensure resources, both in terms of hardware and staff effort, are used more efficiently.
In this paper we describe e-Infrastructure for Research and Innovation for STFC (IRIS) [3], an initiative to bring STFC computing interests and the needs of research communities together.

IRIS
IRIS is based on a bottom-up approach, formed by communities working together. Rather than being an e-infrastructure project which runs its own infrastructure and allocates these resources to user communities, it is a self-formed coordination body consisting of both users (referred to as Science Activities) and resource providers (referred to as Provider Entities).
A non-exhaustive list of science activities participating in IRIS includes the Diamond Light Source [4], the ISIS Neutron and Muon Source [5], the Central Laser Facility [6], the Large Hadron Collider and its experiments [7], the Square Kilometre Array [8], the Rubin Observatory [9], the Euclid satellite [10], the LIGO gravitational wave detector [11] and the Culham Centre for Fusion Energy [12]. Provider entities, who run e-infrastructure, include the STFC Scientific Computing Department [13], DiRAC [14] and GridPP [15]. The IRIS governance structure is summarised in Fig. 1. The Delivery Board (DB) has members representing all of the science activity partners and all of the resource provider entities. The Technical Working Group (TWG) coordinates all technical matters that can be undertaken in common and coordinates the definition of the presentation interfaces for IRIS resources. The Resource Scrutiny and Allocation Panel (RSAP) analyses the resource requests from partners and recommends allocations to the DB, which then makes allocations to user activities. The DB is also responsible for making decisions relating to IRIS-funded hardware and allocating funds to the provider entities.

Hardware
IRIS does not run any e-infrastructure directly but was successful in obtaining significant capital funding, most of which is being used for hardware, with a smaller amount for digital assets (described in the next section). IRIS has commissioned the deployment of resources at sites with a track record of successfully supporting STFC, including the STFC Data Centre at RAL augmented by existing GridPP and DiRAC sites.
The following cumulative capacity physical resources have been deployed for IRIS: • Q4 2018: 4800 cores, 4 PB disk • Q2 2019: 8000 cores, 7 PB disk, tape • Q4 2019: 14000 cores, 14 PB disk, tape These resources are distributed across a few of the IRIS sites. It is important to note that IRIS has no funded staff and therefore relies on the best-effort of the provider sites.

Resource allocation
Each year there is a call for proposals for resources. This is modelled on the LHC Resources Review Board, the WLCG Computing Resources Scrutiny Group, and the DiRAC Resource Allocation Committee. It is intended only for previously peer-reviewed and approved science, with the aim being that only resources need to be justified, not science. This is to uphold a principle of IRIS that science activities are not subject to "double jeopardy".
Applications are initially received by the Technical Scrutiny Team, and there is a formative iteration process which takes place progressively until the final submission deadline. The formative step is a critical part of the process to ensure that applications are well aligned with the resources that IRIS can provide and to ensure that all relevant information has been supplied in time for the panel to conduct its review of the applications. The final submissions are then reviewed and discussed by the RSAP. The preliminary allocations report from the RSAP is provided to the DB, which considers and agrees the final allocations.

Digital assets
IRIS supports the creation of digital assets, which are pieces of software enabling users to exploit data more quickly and efficiently. Most of these funds go to the Ada Lovelace Centre (ALC) for creating digital assets for users of the UK's national facilities. A small portion of the funds is made available to partners in other domains. In this case, there is an annual call for light-weight proposals. Each project will typically create or enhance a piece of software that can be clearly evidenced upon completion and is expected to be useful to several communities or to IRIS as a whole.
An example of a digital asset funded through the ALC includes Data Analysis as a Service (DAaaS), which aims to improve the scientific output of the ISIS Neutron and Muon Source, Central Laser Facility and Diamond Light Source. DAaaS provides users with a remote desktop providing a customised environment for the specific type of analysis required. All required software is pre-installed and data can be easily accessed, allowing scientists to analyse the data obtained in their experiments.
Examples of other funded digital assets include deploying an instance of the INDIGO Identity and Access Management (IAM) [16] service to be used as the IRIS identity proxy, extending the Rucio data management system [17] to support multiple VOs and development of an OpenStack reference platform optimised for scientific computing.

Federation
It is a significant challenge to integrate resources based on different technologies distributed across multiple sites, including high-throughput computing (HTC) clusters, high-performance computing (HPC) clusters, clouds, large scale data stores and large scale databases. It is also necessary to support both large international collaborations in addition to very small groups of researchers. This has to be done without any dedicated IRIS support, as IRIS is not able to fund staff effort.
For the communities to work together in a federation there need to be some common elements in order to operate an integrated distributed infrastructure. These are outlined below.
A policy and trust framework. This is the requirement to establish the necessary policies to allow interoperation between resource providers, services and user groups, taking into account that the resource providers have existing policy frameworks and some providers are already connected within federations (for example the WLCG [18]).
Identity management. Having consistent authentication and authorisation across all resources is important, as having to use different credentials for each resource would be a significant impediment. Also, users expect to be able to use their credentials from their own institutional identity providers, while at the same time it should be possible to provide identities for users who do not have an identity provider.
Information system. This records definitive information about all resources at sites and services in the federation layer. It is used for discovery of services and for declaring downtimes of services in a way which can be used programmatically. An example of such a service is the GOCDB [19] which is used by EGI [20] and the WLCG.
Accounting system. An accounting system is necessary to record the usage of computing and storage resources, which is necessary to understand the contributions of the participating sites and the pattern of use by different user communities. APEL [21], the accounting service used by EGI and the WLCG, has been adopted by IRIS for accounting.
Monitoring. It is important to automatically test the status of components in the infrastructure, with the aim being to detect problems before users. This can include both automated checks and metric based reports.
Ticketing system. It is helpful to have a common ticketing system for reporting problems which can be used by user communities, sites and operators of the services.
Progress has been made on provision of the above items in IRIS through the digital assets funding.

Interface requirements
With such a diverse range of user communities, it was proposed to agree on baselines which user communities and resource providers can use when deciding on which interfaces to use and expose for both compute and storage. This has to be guided by the dual goals of enhancing the existing production infrastructure available to STFC science projects, and progressing to an enhanced, unified infrastructure which is better aligned with the requirements of the projects.
An evolutionary approach was proposed which allows IRIS to have an immediate impact on what is available to projects which require production resources. It is anticipated that the list of interfaces will evolve over time in response to changing user and site requirements. The baseline interfaces were agreed after feedback from user communities and discussion within the Technical Working Group.
To comply with the baseline for compute access, it was agreed that a site must provide either (a) the standard OpenStack APIs (including Keystone, Nova and Glance) [22] for creating VMs, (b) a Grid computing element with batch system backend for Grid jobs (for example the ARC CE [23]) or (c) a Vacuum Cloud [24] for running VM or container definitions provided by the user communities.
For storage access, a site must provide at least one of the following: S3, WebDAV, a shared POSIX filesystem, or a Grid interface (SRM, GridFTP or xrootd). There is an additional constraint that sites providing only OpenStack should provide S3 storage while sites providing only Grid computing element should provide storage with a Grid interface. Figure 2 shows the usage of IRIS computing resources obtained from the APEL accounting system in comparison to the deployed resources. Usage has been increasing over the past year, with the largest users in recent months being Rubin Observatory (LSST), Euclid, Diamond and ISIS. Note that the opportunistic usage of GridPP resources results in the usage shown sometimes exceeding the total IRIS deployed capacity.

Conclusion
A group of resource providers and user communities in the fields of particle physics, astronomy, nuclear physics and space science has taken the initiative to form a cooperative project in order to develop and grow a distributed computing and storage research infrastructure. Already this project is proving successful in fostering collaboration between communities and in the sharing of computing infrastructures. It is expected to have increasing importance as the resource requirements for compute and data intensive scientific research increase in scale in the coming years.