| Issue |
EPJ Web Conf.
Volume 337, 2025
27th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2024)
|
|
|---|---|---|
| Article Number | 01169 | |
| Number of page(s) | 8 | |
| DOI | https://doi.org/10.1051/epjconf/202533701169 | |
| Published online | 07 October 2025 | |
https://doi.org/10.1051/epjconf/202533701169
Reading Tea Leaves - Understanding internal events and addressing performance issues within a CephFS/XRootD Storage Element
Lancaster University, Lancaster, UK
* e-mail: m.doidge@lancaster.ac.uk
** e-mail: g.hand@lancaster.ac.uk
*** e-mail: p.love@lancaster.ac.uk
**** e-mail: s.simpson@lancaster.ac.uk
Published online: 7 October 2025
Erasure-coded storage systems based on Ceph have become a mainstay within UK Grid sites as a means of providing bulk data storage whilst maintaining a good balance between data safety and space efficiency. These storage systems are complex and self-correcting, but despite access to a myriad of metrics, the inner workings of the storage tend to be opaque to the storage admin. One of the common problems seen within Ceph based systems is slow ops—instances of operations that take longer than expected, that are also often blocking in nature, impacting the overall performance and reliability of the system. Identifying the causes of slow ops can help to prevent or reduce the impact of future occurrences, leading to an increase in performance and reliability.
We detail the efforts of the Lancaster Grid Site to understand the causes of and mitigate against these slow ops and other performance bottlenecks within our storage system. We endeavour to bring together a holistic monitoring model, utilising Ceph metrics, detailed XRootD monitoring streams, and client-side logging, in order to understand how data-management events impact the health of the storage.
© The Authors, published by EDP Sciences, 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.

