New developments in cost modeling for the LHC computing

. The increase in the scale of LHC computing during Run 3 and Run 4 (HL-LHC) will certainly require radical changes to the computing models and the data processing of the LHC experiments. The working group established by WLCG and the HEP Software Foundation to investigate all aspects of the cost of computing and how to optimise them has continued producing results and improving our understanding of this process. In particular, experiments have developed more sophisticated ways to calculate their resource needs, we have a much more detailed process to calculate infrastructure costs. This includes studies on the impact of HPC and GPU based resources on meeting the computing demands. We have also developed and perfected tools to quantitatively study the performance of experiments workloads and we are actively collaborating with other activities related to data access, benchmarking and technology cost


Introduction
The preparation for the LHC Run 3, which will see considerable changes in data collection and processing for ALICE and LHCb, and for Run 4, or HL-LHC, has made significant progress in the last year; many factors have contributed to decrease the estimated gap between the estimated amounts of available and needed processing power and storage. In the previous report [1] we quoted a O(10) discrepancy, while now it is closer to a factor 2-3 [2]. The "revolutionary" changes we mentioned as being required to completely close the gap are progressively being introduced or planned.
In addition, thanks to several refinements, the calculation of the cost of computing has improved, both in terms of resources for what is required for the physics program, and in terms of infrastructure costs.
The System Performance and Cost Modeling Working Group, created in 2017 and comprising around thirty members from experiments, sites, IT and software experts, has continued along the roadmap initially defined and even started some new activities, as in the area of data access efficiency, in close collaboration with the DOMA access group [3] and in contributing to the definition of new benchmarks, together with the HEPiX benchmarking working group [4].
In this contribution we will show some recent developments in the areas of work under the domain of this activity.

Software performance
Characterization of software performance is a complex problem, that can be approached from different points of view. While software developers are most concerned with understanding which parts of the code need optimizing, for the computing infrastructure manager it is mainly a question of measuring what is needed to run effectively the experiment workloads. In this case, the application software is to be considered, at a first approximation, as a "black box", and tools like PrMon (which relies on the Linux kernel to extract CPU time, memory, I/O and network metrics for a given process tree) [5] or Trident (which gives access to detailed information on the CPU utilization at the node level using hardware counters) [6] are extremely effective in producing metrics that can be used for infrastructure planning, for benchmarking and for understanding inefficiencies. As an example, Trident was used to quantify how much the experiment workload are similar, or dissimilar, to a given benchmark application, in how they use a CPU [4].
A set of reference workloads from each LHC experiment was analyzed with PrMon, and the resulting values for the metrics are summarized in table 1. The metrics include: number of threads or processes (N thr/proc ), CPU efficiency ( CPU ), time per event (T evt ), memory per core (M c ), read rate per core (R c ) and write rate per core (W c ).
As the full PrMon output consists of time series for each metric, we looked into ways to parametrize the time series using a minimal set of parameters. A technique based on CPOP (Continuous piecewise linear Pruned Optimal Partitioning) [7] is able to detect changepoints and therefore reduce a time series to a very small number of points (figure 1). Another work looked into the effect of varying limitations of system resources (memory, network bandwidth and network latency) on the reference workloads, in particular on their wall-clock time. It is  Studies on the effect of compiler versions and optimizations were done using Geant4 simulation, which showed that statically compiling libraries may achieve a 10% speedup with respect to dynamically compiled libraries, and switching from gcc 4.8.5 to 8.2.0 resulted in a 30% speedup [8]. Consistent results were obtained for CMS simulation [9].

Resource estimation
An important step in the process of calculating the cost of computing is to estimate the amount of resources (CPU, disk, tape and network) needed to fulfill the phyics programme of the experiment. Having a sufficiently complex and flexible model is a prerequisite to produce reliable estimates and at some point it was proposed to develop a common framework that could be used by all LHC experiments. Although such framework never came to be, the experiments are now in a much better situation, having modular software-based frameworks instead of the unwieldy spreadsheets that were used for many years. For example, ATLAS and CMS have now frameworks very much comparable in terms of functionality and parameters.
Recent work in CMS [10] focused in adding for the first time estimates for the required tape I/O bandwidth at HL-LHC and for the network capacity; there are still significant uncertainties though, like the future role of GPUs and accelerators, which is impossible to quantify at this point in time.
Given the maturity reached by this estimation process, it is reasonable to assume that the cost model working group will not need to play a role any longer, but for facilitating exchange of information among experiments and identifying possible gaps that would need to be addressed.

Site cost estimation
In order to estimate the feasibility of possible computing models in the future of LHC, some understanding of computing resources costs is necessary, at a global scale. This requires an analysis of what computing expenses are today and what they are likely to be in the coming years, based on what we have been observing over the last years. The main difficulty is that WLCG is composed of various academic data centres ("sites"), which provide different types of hardware resources and procure them at different prices, provide different levels of service, and belong to different nations with their own particularities (funding model, strategy, energy providers and so on). We tried to estimate and understand the origin of these heterogeneities and provide a set of indicators that make sense for all sites.

Computing resource costs
The first major step of this approach consisted in gathering relevant indicators related to computing resource and energy costs from the biggest data centers participating in WLCG. We first focused on the distributions of CPU, disk and tape cartridge costs for hardware procured in 2018, as well as the local energy costs. Figure 2 represents the anonymized results and Table 2 summarizes the average values and deviation across sites. Note that due to the special financial model of Site "F" tape storage, we removed its contribution from the average tape cartridge cost value and its standard deviation.
If CPU price is rather homogeneous across sites, the variance in storage costs is quite important. This effect is due to a larger diversity in the local storage technology market, and unlike CPU, the performance is not benchmarked and remains harder to compare across sites. Therefore, we believe important to stress that despite the cost calculation rule is similar for all sites, the local context, market, and technology choices are the most important contribution to the cost differences.
In addition to the 2018 purchase costs, we established the evolution of the purchase costs over the years, per site, although not all sites could provide relevant data. It is clear that CPU price evolution is not as fast as expected: all sites oberve a slow down in CPU price evolution trend, which remains below −20% per year in absolute value. Concerning disk storage, the picture is rather coherent and many sites observed a price evolution of about −15% per year. In the case of tape storage, it is difficult to draw any conclusion from the observed cost trends across sites: an important change in the tape and libraries market has been taking place recently, and every site is in a different situation: we will probably stay on average below the −20% per year for some time.
This set of results provides some insight on the current situation with respect to resource procurement expenses and their homogeneity. The trends observed over the last years allow  to make a rough guess on what prices may become in the next years. However, due to the dispersion of the observed trends across sites, it does not seem reasonable to try to make predictions for many years ahead.

Total Cost of Ownership
In this section we address the total cost of ownership (TCO) of a data center. There is no simple and standard way to calculate a data center TCO, so we use here two different methods which we call "atomic" and "holistic", based on opposite approaches. The atomic TCO approach is inspired from the CERN procurement model; it calculates every component needed to build up a rack with servers, PDUs, switches, uplinks, router, building etc. in association with their respective expenses and lifetime. The required energy expenses are included, and the human effort expenses to operate a service are also calculated based on staff expertise and estimated time needed to operate a service. The results are summarized in Figure 3. From the calculation, in both cases IT expenses contribute to most of the atomic TCO, at a level of 70-90 %.  The holistic TCO simply consists in taking the average yearly budget spent by a data center. By definition this includes every expense, those of the atomic TCO plus other sources of expenses, such as tape system libraries, facility maintenance, developers, project managers, support, administrative staff etc. In this approach, major expenses that take place occasionally are distributed over the years according to their expected amortization period. The calculation of the holistic TCO results in a drastically different picture, where IT contributes to 30% of the budget, the facility about 20%, and half of the budget is invested in manpower.
The difference in the results is mainly explained by the fact that the two considered TCO calculations do not address the same aspects, especially concerning human effort. Whereas the holistic TCO provides an insight on the breakdown of expenses in one of the main WLCG data centers, including staff not directly related to the provision of the final service, the atomic TCO is probably more suited to compute marginal costs when providing an additional computing or disk-based storage service.

Other improvements
A very promising work has started, to study the impact of storage caches on the cost of computing for smaller sites. The idea behind it is the consolidation of managed storage at a few large sites forming "data lakes", and its replacement with storage caches elsewhere. This would bring several advantages: lessen the need to create long-lived replicas of datasets (simpler data management), reduce the amount of "cold data" (less storage used), replace complex storage services with simpler ones (lighter site operations) and allow for less redundant and expensive storage configurations (less cost for TB). This work is using actual file access data from ATLAS and CMS and is described in detail elsewhere in these proceedings [11].

Cost effectiveness of HPC resources
An aspect of the computing evolution that has not yet been completely understood is the scenario, particularly likely in certain regions, where computing resources at a national level are increasingly provided by supercomputing (HPC) centers at the detriment of more "traditional" computing centers, like those operating as Tier-1s and Tier-2s in WLCG. Using HPC centers for the LHC experiments workloads presents considerable challenges, among which: • hardware heterogeneity: non-x86 CPUs, GPUs, accelerators • different authorization/authentication systems, network restrictions, time-limited resource allocations, etc.
• quantification of the usable resources with respect to each relevant workload The last point in particular was the subject of a preliminary proposal by the cost model and the HEPiX benchmarking working groups, to define a practical procedure to map the capacity of an HPC resource to a WLCG "pledge", briefly described here.
The first step consists of running at a small scale an eligible workload A 1 for a certain time on a certain number of cores and accelerators (not shared with other workloads) and measure the amount of CPU time T CPU (cores·s) and accelerator time T acc (accelerators·s) needed to process one event. Subsequently, the same workload is run on a "traditional" system and the amount of HS06·s [12] to process one event is measured. By equating the computing resource usages in the two cases, one can translate a given WLCG pledge to a certain amount of processing time on a given HPC resource. In reality, one will have to take into account also bottleneck effects, for example when the accelerator is underutilized and the CPU is the limiting factor. It is important to know that this equivalence is in principle different for each workload, as each one may use differently the resources, and different for each HPC system, potentially leading to a very large number of combinations; in practice, we can expect only a handful of systems to be used on the few workloads that can use them most efficiently.
Another important metric to be defined would be a measure of how "well" we are exploiting an HPC resource, that we could call realized potential. A possibility is to use as reference the R max FLOPS measured by LINPACK to express the maximum achievable performance of the HPC system; integrating it over the length of the time allocation, it produces a certain amount of FLOP for the CPUs and the GPUs. While the workload is run, the CPU and GPU utilization levels are measured (as a percent), multiplied by the amounts of FLOP calculated above, and the total FLOP utilized is divided by the total FLOP achievable. This ratio would represent how much of the computing power of the HPC resource is actually used, and it could be used to determine which workloads are best run on the resource, and which resources are most cost-effective for the pledge provider.
In reality, experience shows that running on an HPC resource usually needs a considerable amount of preparation, which is not taken into account in the metrics defined above. It is also clear that, at this time, very limited use can be done (if any) of non-CPU resources for production workloads; however, this is going to change, as more HEP workloads become able to utilize GPUs.

Conclusions
After two years of activity it is the time to re-examine the roadmap and the goals of the working group. Many of its activities have matured and would be best conducted in other groups, among which DOMA access for the storage cost efficiency studies and the HEP Software Foundation or the experiments for performance analysis tools and compiler optimization studies. Resource estimation falls solidly in the domain of experiments, while site cost estimation is still ideally covered by the cost model working group.
Perhaps one of the biggest achievements was to raise the attention and stimulate the community to think, even more than before, on how to overcome the capacity gap that would make HL-LHC computing impossible, if not addressed. This was possible also by participating to computing schools or contributing to workshops related to efficiency and performance of HEP software.
Therefore it seems appropriate to re-scope the working group by concentrating on topics unique to it (site cost calculation, analysis of cost differences) and run topical workshops with experiments and sites rather than having regular meetings.