Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS

Wen Guan; Tadashi Maeno; Rui Zhang; Christian Weber; Torre Wenaus; Aleksandr Alekseev; Fernando Harald Barreiro Megino; Kaushik De; Edward Karavakis; Alexei Klimentov; Tatiana Korchuganova; FaHui Lin; Paul Nilsson; Zhaoyu Yang; Xin Zhao

doi:10.1051/epjconf/202429504019

All issues

Volume 295 (2024)

EPJ Web of Conf., 295 (2024) 04019

Abstract

Open Access

Issue		EPJ Web of Conf. Volume 295, 2024 26^th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2023)


Article Number		04019
Number of page(s)		7
Section		Distributed Computing
DOI		https://doi.org/10.1051/epjconf/202429504019
Published online		06 May 2024

EPJ Web of Conferences 295, 04019 (2024)
https://doi.org/10.1051/epjconf/202429504019

Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS

Wen Guan¹^*, Tadashi Maeno¹, Rui Zhang², Christian Weber¹, Torre Wenaus¹, Aleksandr Alekseev³, Fernando Harald Barreiro Megino³, Kaushik De³, Edward Karavakis¹, Alexei Klimentov¹, Tatiana Korchuganova⁴, FaHui Lin³, Paul Nilsson¹, Zhaoyu Yang¹ and Xin Zhao¹

¹ Brookhaven National Laboratory, Upton, NY, USA
² University of Wisconsin-Madison, Madison, USA
³ University of Texas at Arlington, Arlington, TX, USA
⁴ University of Pittsburgh, Pittsburgh, PA, USA

^* e-mail: wguan2@bnl.gov

Published online: 6 May 2024

Abstract

Machine Learning (ML) has become one of the important tools for High Energy Physics analysis. As the size of the dataset increases at the Large Hadron Collider (LHC), and at the same time the search spaces become bigger and bigger in order to exploit the physics potentials, more and more computing resources are required for processing these ML tasks. In addition, complex advanced ML workflows are developed in which one task may depend on the results of previous tasks. How to make use of vast distributed CPUs/GPUs in WLCG for these big complex ML tasks has become a popular research area. In this paper, we present our efforts enabling the execution of distributed ML workflows on the Production and Distributed Analysis (PanDA) system and intelligent Data Delivery Service (iDDS). First, we describe how PanDA and iDDS deal with large-scale ML workflows, including the implementation to process workloads on diverse and geographically distributed computing resources. Next, we report real-world use cases, such as HyperParameter Optimization, Monte Carlo Toy confidence limits calculation, and Active Learning. Finally, we conclude with future plans.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.