Issue |
EPJ Web of Conf.
Volume 295, 2024
26th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2023)
|
|
---|---|---|
Article Number | 01015 | |
Number of page(s) | 8 | |
Section | Data and Metadata Organization, Management and Access | |
DOI | https://doi.org/10.1051/epjconf/202429501015 | |
Published online | 06 May 2024 |
https://doi.org/10.1051/epjconf/202429501015
Data Popularity for Cache Eviction Algorithms using Random Forests
1 CERN, IT Department, Geneva, Switzerland
2 Inria, Côte d’Azur University, Sophia Antipolis, France
* e-mail: olga.chuchuk@cern.ch
** e-mail: markus.schulz@cern.ch
Published online: 6 May 2024
In the HEP community the prediction of Data Popularity is a topic that has been approached for many years. Nonetheless, while facing increasing data storage challenges, especially in the upcoming HL-LHC era, there is still the need for better predictive models to answer the questions of whether particular data should be kept, replicated, or deleted.
Caches have proven to be a convenient technique for partially automating storage management, potentially eliminating some of these questions. On the one hand, one can benefit even from simple cache eviction policies like LRU, on the other hand, we show that incorporation of knowledge about future access patterns has the potential to greatly improve cache performance.
In this paper, we study data popularity on the file level, where the special relation between files belonging to the same dataset could be used in addition to the standard attributes. We turn to Machine Learning algorithms, such as Random Forest, which is well suited to work with Big Data: it can be parallelized, is more lightweight and easier to interpret than Deep Neural Networks. Finally, we compare the results with standard cache eviction algorithms and the theoretical optimum.
© The Authors, published by EDP Sciences, 2024
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.