Large-scale HPC deployment of Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN)

The SCAILFIN project aims to


Motivation
In parallel: Interest in leveraging Machine Learning (ML) and Artificial Intelligence (AI) techniques, to enhance the analysis of data from these facilities.
In particular, its application with emergent Likelihood-Free Inference (LFI) techniques when the predictions for the data are implicitly defined by the simulation, often leading to an intractable likelihood function.This can apply to analysis of data from LHC, LIGO, etc, but such Likelihood-Free algorithms have so far been implemented mostly on individual machines and in ad hoc scripts because the training workflows are very complicated.

SCAILFIN: Scalable CyberInfrastructure for Artificial Intelligence and Likelihood
Free Inference The SCAILFIN project aims to deploy artificial intelligence and likelihood-free inference techniques and software using scalable cyberinfrastructure (CI) that is developed to be integrated into existing CI elements, such as the REANA system, to work on HPC facilities.

Simulation-Based Likelihood Free Inference
Symbolically: Estimation of optimal estimator lends itself to ML methods: • Training data derived from simulations • Can be guided by optimal sampling based on phase space density of generator, sensitivity to physics under study
• One of the goals of the project is extending the REANA platform to allow remote submission of workflows to HPC facilities.• ... As mentioned before, SCAILFIN plans to use: • REANA as the Cyber Infrastructure element to deploy AI and Likelihood-Free inference techniques.
• We are also leveraging VC3 (Virtual Clusters for Community Computation) in order to scale REANA to HPC resources.
First, a brief overview of these 2 components...
Upload workflow and inputs to REANA cloud 3.
Download / pull down results

5.
Share workflow specs with others

Components
• Two major components each consisting of many sub-components ○ reana-client: User facing component.■ Accepts workflows and and is used as interface to entire REANA system (for user).
○ reana-cluster: Workhorse.■ Consists of many small pieces which handle workflows, dish out jobs, coordinates results, can be thought of as the job scheduler.Jobs are scheduled via Kubernetes.

VC3: Virtual Clusters for Community Computation
VC3: A platform for provisioning cluster frameworks over heterogeneous resources for collaborative science  • REANA requires some form of docker supporting container technology ○ Singularity and Shifter support finished.
• REANA expects to submit to a kubernetes cluster ○ Added support for VC3 specialized HTCondor submissions through a modified reana-job-controller and a job_wrapper for every workflow step.
○ The modified reana-job-controller submits each workflow step to a local condor scheduler • Job Wrapper Auto-detection of container technology for workflow steps.(shifter, singularity) VC3 Modifications: • Cluster template for REANA+HTCondor ○ Uses the standard HTCondor template as the base to create a condor pool that sends jobs to HPC resources, translating the job to the corresponding batch system submission syntax via bosco.
○ Deploys Kubernetes via minikube ○ Deploys the REANA cluster and client and set up the environment, so the user can interact with them as soon as the VC3 headnode is created.
• GSI-SSH support ○ The GSI-SSH authentication mechanism was added in the infrastructure, in order to support e.g.P XSEDE HPC centers like Blue Waters, Stampede, NERSC.

•Features•
Overlays "cluster" environment on top of diverse resource allocations • Similar to cloud services that allow you to stand up clusters, but on "your" resources, and in "user space": no root access needed https://www.virtualclusters.orgVC3 Architecture • User defines an allocation • Selects middleware configuration • VC3 infrastructure creates VC3 headnode and configures resources • Workers on compute nodes communicate with VC3 headnode to receive compute workloadsUser adds an SSH public key to the system that will be used to grant access to the The user can select its own middleware for submission (E.g.: HTCondor, WorkQueue, Spark, REANA+HTCondor).•It doesn't matter what the resource target batch system is (as long as it is supported by glite/blah, the translation layer for submission).

E
.g.: Torque (Blue Waters), SLURM (NERSC, PSC-Bridges, Stampede2), HTCondor, LSF, SGE, PBS.VC3: Virtual Clusters for Community Computation Implementation: A REANA Cluster Template for VC3 Many users in a project can have access to the same allocation and submit node.Using REANA + VC3 on Blue Waters* 19 Blue Waters cluster: -Batch system: Torque -Container technology: Shifter -Authentication mechanisms: -Multi-factor authentication (Password + RSA token) -GSI-SSH tokens Virtual cluster created on top of Blue Waters: -VC3 Submit node with kubernetes (via minikube) and a REANA cluster deployed on the fly.-HTCondor as the middleware -VC3 authenticates with Blue Waters via GSI-SSH *Note: Infrastructure worked out of the box on other resources such as the ND HPC Cluster and XSEDE/PittsburghMike Hildreth: mhildret@nd.edu;ND Developers Kenyi Hurtado: khurtado@nd.edu,Cody Kankel: ckankel@nd.edu
• REANA requires some form of docker supporting container technology ○ Singularity and Shifter support in the works.Possibly CharlieCloud• REANA expects to submit to a kubernetes cluster ○ Added support for VC3 specialized HTCondor submissions through a modified reana-jobcontroller and a job_wrapper for every workflow step.○ The modified reana-job-controller submits each workflow step to a local condor scheduler • Job wrapper ○ Each workflow step is wrapped by a script which searches for container technology and launches each workflow step into the available container (shifter, singularity) • Performs check for available container technologies ○ Checks Binaries in $PATH (VC3 may auto-load these) ○ Attempts to load modules for Singularity and Shifter ○ Executes workflow-step within discovered container ■ Will choose depending on the currently set $default Job-wrapper Container management exec -B ./$REANA_WORKFLOW_DIR:$REANA_WORKFLOW_DIR\ docker://$DOCKER_IMG" Singularity Arguments --image=docker:${DOCKER_IMG} --volume=$(pwd -P)/reana:/reana --