Producing Madgraph5_aMC@NLO gridpacks and using TensorFlow GPU resources in the CMS HTCondor Global Pool

The CMS experiment has an HTCondor Global Pool, composed of more than 200K CPU cores available for Monte Carlo production and the analysis of data. The submission of user jobs to this pool is handled by either CRAB, the standard workflow management tool used by CMS users to submit analysis jobs requiring event processing of large amounts of data, or by CMS Connect, a service focused on final stage condor-like analysis jobs and applications that already have a workflow job manager in place. The latest scenario can bring cases in which workflows need further adjustments in order to efficiently work in a globally distributed pool of resources. For instance, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO and the usage of tools not (yet) fully supported by the CMS software, such as TensorFlow with GPU support, are tasks with particular requirements. A special adaption, either at the pool factory level (advertising GPU resources) or at the execute level (e.g: to handle special parameters that describe certain needs for the remote execute nodes during submission) is needed in order to adequately work in the CMS global pool. This contribution describes the challenges and efforts performed towards adapting such workflows so they can properly profit from the Global Pool via CMS Connect. 1 The submission system While submission of CMS [1] user jobs to the Global Pool [2] is mostly managed by CRAB [3], the standard analysis workflow management tool, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO [4] and the usage of machine learning tools with GPU resources are independent use-cases that require special adaptation in order to take advantage of the Global Pool resources.


The submission system
While submission of CMS [1] user jobs to the Global Pool [2] is mostly managed by CRAB [3], the standard analysis workflow management tool, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO [4] and the usage of machine learning tools with GPU resources are independent use-cases that require special adaptation in order to take advantage of the Global Pool resources.
CMS Connect [5] provides a service where users can submit HTCondor jobs to the CMS Global Pool (a global HTCondor pool provisioned by GlideinWMS) with a submission interface similar to those provided by analysis facilities physicists are familiar with, such as the CERN Analysis Facility [6]. This service complements CRAB, as illustrated in Figure 1, dealing with a different set of analysis workflows, such as Madgraph gridpacks and the use of GPU resources with TensorFlow [7] jobs. The sections below describe the challenges and efforts performed towards adapting these two different workflow types in order to properly work with CMS Connect and the Global Pool resources.

Generating Madgraph5_aMC@NLO gridpacks
Monte Carlo (MC) event generators, such as Madgraph5_aMC@NLO, are used to model physics processes in the high energy physics field. The information generated, including for example, the computation of the differential cross sections and final state particles involved in these processes, is stored in a compressed tarball package called gridpack. This is one of the very first steps in the simulation chain that produces the MC samples used in physics analyses, as shown in Figure 2.
A generator's package that automates the production of these gridpacks by setting up the CMS software environment and providing Madgraph5_aMC@NLO is used in the experiment. From a computational point of view, this can be achieved in two different ways, by using all cores available on a single machine, or by having Madgraph5_aMC@NLO create and submit multiple jobs to a batch manager (e.g.: HTCondor [8]).
The second method is preferred for complex processes, due to the high demand of CPU power. Furthermore, while local resources, such as the CERN Analysis Facility or local Tier 3s, where users have login access to the resource batch submission system (in contrast to grid-enabled resources, where a grid middleware manages the submission to the batch system), can in principle be used for this goal, not all CMS users have access to the same local resources and the submission methods can vary, depending on the batch submission manager available. Also, long term running jobs might need special requirements, such as renewing AFS tokens periodically. The submission of jobs to the Global Pool offers many advantages: • Higher computing power distributed across all grid site resources available in the CMS Global Pool.
• Better accounting and monitoring of jobs.
• A central submission node for all CMS users with a grid proxy certificate registered in the CMS Virtual Organization.
• A single batch submission manager (HTCondor) to deal with.
However, the Global Pool infrastructure expects certain parameters that characterize the job that are not set by default, such as the maximum executable wall time estimated, or a list of the CMS sites to submit the jobs to. Additionally, jobs that were not able to finish running because of an error that might require further action and are put on "hold" state in the system https://doi.org/10.1051/epjconf/201921403004 CHEP 2018 (meaning, these jobs will not match to any resource until they are released) are treated as a general failure in Madgraph5_aMC@NLO, aborting the whole submission, but transient errors leading to held jobs are not uncommon when submitting to several different sites globally. The cluster manager in Madgraph5_aMC@NLO was adapted in order to account for these factors.
For instance, a dynamic adjustment of the requested maximum wall time per job is performed, as well as specifying the the remote sites for submission through environment variables in the system (while in most cases, matching to all sites is desired, selecting particular sites can be especially useful to e.g: exclude sites known to have transient issues at the time, test submission of jobs with special dependencies not yet distributed via the CernVM File System [9] to a specific site, etc). The cluster manager in Madgraph5_aMC@NLO was modified to use the HTCondor python bindings in order to check for the status of the jobs and release held jobs with common transient errors for retrial. Also, the environment in the worker nodes were adjusted in order to propagate library dependency paths that are lost when using Singularity [10] containers (the default behavior for remote resources in the Global Pool). Figure 3 shows a diagram with the changes described above. Additionally, gridpack jobs set special HTCondor classads that are later used to track the activity of each on CMS monitoring dashboards. For example, Figure 4 shows the gridpack activity in the Global Pool divided by name. The name for each gridpack was stored as an HTCondor classad that is later used at the monitoring side in order to make this classification.

Deep learning and GPU resources
Machine Learning algorithms, such as boosted decision trees, random forest or artificial neural networks, have been successfully used within the high energy physics field for decades, but the rise in terms of demand of GPU resources started just a few years ago, with the training of deep neural networks, a subset of Machine Learning inspired in artificial neural networks (see Figure 5 1 ).  The availability of thousands of cores and a faster bandwidth to memory than conventional CPU resources are some of the main features in GPU resources. On the other hand, the low memory, low clock speeds and the fact data has to be transferred to the GPU card makes it challenging for several applications to take advantage of this. However, deep learning algorithms involve several matrix multiplication and other operations that can be massively parallelized and are not tied to high memory requirements or large data transfers, making it suitable for the GPU architecture.
The usage of deep learning algorithms in industry has lead to the development of powerful machine learning frameworks. For instance, Tensorflow [7] is an open source software library, originally developed by Google, providing strong support for machine learning and deep learning with several APIs for programming languages, such as Python and C++, two popular languages in the high energy physics community.

Using TensorFlow and GPU resources in the Global Pool
Even though many grid sites with GPU resources are available in the CMS Global Pool, meeting the software dependencies needed in order to use deep learning algorithms in them can become a challenge due to the lack of support of TensorFlow and other related frameworks in the base Operating Systems commonly used by CMS (Red Hat 6 and 7).
To help with this, CMS provides such dependencies through CVMFS, but its support is only available at the CPU level. The integration with GPU resources can easily fall into potential conflicts with GPU library dependencies. For instance, different TensorFlow versions can require specific versions of cuDNN (the Nvidia Deep Learning SDK) or the CUDA [12] toolkit to work.
To overcome this issue on a wider scale, Singularity containers [10,13] based on Ubuntu with TensorFlow installed with GPU support are built, maintained and distributed via CVMFS by the Open Science Grid (OSG) [14,15]. Figure 6 illustrates the different components involved in the provisioning of such software dependencies handled by the OSG. These Singularity images are used with the CMS resources in a transparent way, due to the full support for Singularity in the CMS Global Pool infrastructure.

Conclusions
This work has presented the needs and challenges present in two different types of workflows existing in the CMS collaboration, as well as the solutions provided in order to make them compatible with Global Pool resources via CMS Connect. While the production of gridpacks required adaptations at the Madgraph5_aMC@NLO code level, in order to set a group of job