| Issue |
EPJ Web Conf.
Volume 337, 2025
27th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2024)
|
|
|---|---|---|
| Article Number | 01172 | |
| Number of page(s) | 6 | |
| DOI | https://doi.org/10.1051/epjconf/202533701172 | |
| Published online | 07 October 2025 | |
https://doi.org/10.1051/epjconf/202533701172
Efficiency, Reproducibility, and Portability in HEP Machine Learning Training - ML Training Facility at Vanderbilt University
Vanderbilt University, Nashville, Tennessee, USA
* e-mail: jethrogaglione@gmail.com
** e-mail: andrew.m.melo@vanderbilt.edu
*** e-mail: satkardhakal@gmail.com
**** e-mail: pkoiralap@gmail.com
Published online: 7 October 2025
The success and adoption of machine learning (ML) approaches to solving HEP problems has been widespread and fast. As useful a tool as ML has been in the field, the growing number of applications, larger datasets, and increasing complexity of models creates a demand for both more capable hardware infrastructure and cleaner methods of reproducibility and deployment. We have developed a prototype ML Training Facility (MLTF) with the goal of meeting these demands. The proof-of-concept MLTF is based at ACCRE, Vanderbilt’s computing cluster, with sufficient GPU storage and networking to efficiently test very large models.The software component of MLTF is developed with an eye on reproducibility and portability. We adapt MLflow as an end-toend ML solution for its capabilities as a user-friendly job submission interface; as a tracking server for model and run details, arbitrary metrics logging, and system diagnostics logging; and as an inference server.
© The Authors, published by EDP Sciences, 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.

