The CMS monitoring applications for LHC Run 3

Brij Kishor Jashal; Valentin Kuznetsov; Federica Legger; Ceyhun Uzunoglu

doi:10.1051/epjconf/202429504045

All issues

Volume 295 (2024)

EPJ Web of Conf., 295 (2024) 04045

Abstract

Open Access

Issue		EPJ Web of Conf. Volume 295, 2024 26^th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2023)


Article Number		04045
Number of page(s)		7
Section		Distributed Computing
DOI		https://doi.org/10.1051/epjconf/202429504045
Published online		06 May 2024

EPJ Web of Conferences 295, 04045 (2024)
https://doi.org/10.1051/epjconf/202429504045

The CMS monitoring applications for LHC Run 3

Brij Kishor Jashal¹, Valentin Kuznetsov², Federica Legger³^* and Ceyhun Uzunoglu⁴ on behalf of the CMS Collaboration

¹ Cornell University, Ithaca (NY), USA
² Tata Institute of Fundamental Research, Mumbai, India
³ Istituto Nazionale di Fisica Nucleare, Torino, Italy
⁴ CERN, Geneva, Switzerland

^* e-mail: federica.legger@cern.ch

Published online: 6 May 2024

Abstract

Data taking at the Large Hadron Collider (LHC) at CERN restarted in 2022. The CMS experiment relies on a distributed computing infrastructure based on WLCG (Worldwide LHC Computing Grid) to support the LHC Run 3 physics program. The CMS computing infrastructure is highly heterogeneous and relies on a set of centrally provided services, such as distributed workload management and data management, and computing resources hosted at almost 150 sites worldwide. Smooth data taking and processing requires all computing subsystems to be fully operational, and available computing and storage resources need to be continuously monitored. During the long shutdown between LHC Run 2 and Run 3, the CMS monitoring infrastructure has undergone major changes to increase the coverage of monitored applications and services, while becoming more sustainable and easier to operate and maintain. The used technologies are based on open-source solutions, either provided by the CERN IT department through the MONIT infrastructure, or managed by the CMS monitoring team. Monitoring applications for distributed workload management, submission infrastructure based on HTCondor, distributed data management, facilities have been ported from mostly custom-built applications to use common data flow and visualization services. Data are mostly stored in non-SQL databases and storage technologies such as ElasticSearch, VictoriaMetrics, Prometheus, InfluxDB and HDFS, and accessed either via programmatic APIs, Apache Spark or Sqoop jobs, or visualized preferentially using Grafana. Most CMS monitoring applications are deployed on Kubernetes clusters to minimize maintenance operations. In this contribution we present the full stack of CMS monitoring services and show how we leveraged the use of common technologies to cover a variety of monitoring applications and cope with the computing challenges of LHC Run 3.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.