EPJ Web Conf.
Volume 251, 202125th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2021)
|Number of page(s)||10|
|Section||Distributed Computing, Data Management and Facilities|
|Published online||23 August 2021|
Anomaly detection in the CERN cloud infrastructure
1 CERN, Geneva, Switzerland
2 Nuclear Physics Institute of the Czech Academy of Sciences, Řež, Czech Republic
* e-mail: email@example.com
Published online: 23 August 2021
Anomaly detection in the CERN OpenStack cloud is a challenging task due to the large scale of the computing infrastructure and, consequently, the large volume of monitoring data to analyse. The current solution to spot anomalous servers in the cloud infrastructure relies on a threshold-based alarming system carefully set by the system managers on the performance metrics of each infrastructure’s component. This contribution explores fully automated, unsupervised machine learning solutions in the anomaly detection field for time series metrics, by adapting both traditional and deep learning approaches. The paper describes a novel end-to-end data analytics pipeline implemented to digest the large amount of monitoring data and to expose anomalies to the system managers. The pipeline relies solely on open-source tools and frameworks, such as Spark, Apache Airflow, Kubernetes, Grafana, Elasticsearch. In addition, an approach to build annotated datasets from the CERN cloud monitoring data is reported. Finally, a preliminary performance of a number of anomaly detection algorithms is evaluated by using the aforementioned annotated datasets.
© The Authors, published by EDP Sciences, 2021
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.