Apache Spark usage and deployment models for scientific computing

Diogo Castro; Prasanth Kothuri; Piotr Mrowczynski; Danilo Piparo; Enric Tejedor

doi:10.1051/epjconf/201921407020

Open Access

Issue		EPJ Web Conf. Volume 214, 2019 23^rd International Conference on Computing in High Energy and Nuclear Physics (CHEP 2018)


Article Number		07020
Number of page(s)		10
Section		T7 - Clouds, virtualisation & containers
DOI		https://doi.org/10.1051/epjconf/201921407020
Published online		17 September 2019

EPJ Web of Conferences 214, 07020 (2019)
https://doi.org/10.1051/epjconf/201921407020

Apache Spark usage and deployment models for scientific computing

Diogo Castro, Prasanth Kothuri, Piotr Mrowczynski, Danilo Piparo and Enric Tejedor

CERN. 1 Esplanade des Particules, Meyrin, Switzerland

Published online: 17 September 2019

Abstract

This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.