Issue |
EPJ Web Conf.
Volume 245, 2020
24th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2019)
|
|
---|---|---|
Article Number | 09011 | |
Number of page(s) | 6 | |
Section | 9 - Exascale Science | |
DOI | https://doi.org/10.1051/epjconf/202024509011 | |
Published online | 16 November 2020 |
https://doi.org/10.1051/epjconf/202024509011
Large-scale HPC deployment of Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN)
1
Department of Physics, University of Notre Dame, Notre Dame, IN, USA
2
Center for Research Computing, University of Notre Dame, Notre Dame, IN, USA
3
IT/CDA Division, CERN, 1211 Meyrin, Switzerland
* Corresponding author: mhildret@nd.edu
Published online: 16 November 2020
The NSF-funded Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN) project aims to develop and deploy artificial intelligence (AI) and likelihood-free inference (LFI) techniques and software using scalable cyberinfrastructure (CI) built on top of existing CI elements. Specifically, the project has extended the CERN-based REANA framework, a cloud-based data analysis platform deployed on top of Kubernetes clusters that was originally designed to enable analysis reusability and reproducibility. REANA is capable of orchestrating extremely complicated multi-step workflows, and uses Kubernetes clusters both for scheduling and distributing container-based workloads across a cluster of available machines, as well as instantiating and monitoring the concrete workloads themselves. This work describes the challenges and development efforts involved in extending REANA and the components that were developed in order to enable large scale deployment on High Performance Computing (HPC) resources. Using the Virtual Clusters for Community Computation (VC3) infrastructure as a starting point, we implemented REANA to work with a number of differing workload managers, including both high performance and high throughput, while simultaneously removing REANA’s dependence on Kubernetes support at the workers level.
© The Authors, published by EDP Sciences, 2020
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.