Using Big Data Technologies for HEP Analysis

Matteo Cremonesi; Claudio Bellini; Bianny Bian; Luca Canali; Vasileios Dimakopoulos; Peter Elmer; Ian Fisk; Maria Girone; Oliver Gutsche; Siew-Yan Hoh; Bo Jayatilaka; Viktor Khristenko; Andrea Luiselli; Andrew Melo; Evangelos Motesnitsalis; Dominick Olivito; Jacopo Pazzini; Jim Pivarski; Alexey Svyatkovskiy; Marco Zanetti

doi:10.1051/epjconf/201921406030

Open Access

Issue		EPJ Web Conf. Volume 214, 2019 23^rd International Conference on Computing in High Energy and Nuclear Physics (CHEP 2018)


Article Number		06030
Number of page(s)		8
Section		T6 - Machine learning & analysis
DOI		https://doi.org/10.1051/epjconf/201921406030
Published online		17 September 2019

EPJ Web of Conferences 214, 06030 (2019)
https://doi.org/10.1051/epjconf/201921406030

Using Big Data Technologies for HEP Analysis

Matteo Cremonesi²^*, Claudio Bellini⁴, Bianny Bian⁴, Luca Canali¹, Vasileios Dimakopoulos¹, Peter Elmer⁵, Ian Fisk³, Maria Girone¹, Oliver Gutsche², Siew-Yan Hoh⁷, Bo Jayatilaka², Viktor Khristenko¹, Andrea Luiselli⁴, Andrew Melo⁶, Evangelos Motesnitsalis¹, Dominick Olivito⁸, Jacopo Pazzini⁷, Jim Pivarski⁵, Alexey Svyatkovskiy⁵ and Marco Zanetti⁷

¹ European Organization for Nuclear Research CERN, Geneva, Switzerland
² Fermi National Accelerator Laboratory, Batavia, IL, USA
³ Flatiron Institute of the Simons Foundation, New York, NY, USA
⁴ Intel Corporation, Santa Clara, USA
⁵ Princeton University, Princeton, NJ, USA
⁶ Vanderbilt University, Nashville, TN, USA
⁷ University of Padova, Padova, Italy
⁸ University of California San Diego, La Jolla, USA

E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Published online: 17 September 2019

Abstract

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.

In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis.

The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.