Improving WLCG Networks Through Monitoring and Analytics

WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. OSG Networking Area in partnership with WLCG has focused on collecting, storing and making available all the network related metrics for further analysis and discovery of issues that might impact network performance and operations. In order to help sites and experiments better understand and fix the networking issues, WLCG Network Throughput working group was formed, which works on the analysis and integration of the networkrelated monitoring data collected by the OSG/WLCG infrastructure and operates a support unit to help find and fix the network performance issues. This paper describes the current state of the OSG network measurement platform and summarises the activities taken by the working group, including updates on the higher level services that were recently developed, network performance incidents investigated as well as past and present analytical activities related to networking and their results.


Introduction
Open Science Grid (OSG) and Wordwide LHC Computing Grid (WLCG) have been supporting network monitoring activities since 2012, focusing on assisting its users and affiliates on improving the overall network throughput by introducing active monitoring of our networks and ability to test and identify potential network performance bottlenecks [1,2]. Two important areas of development that were undertaken were establishing and operating a global network of measurements agents and development and operations of a comprehensive networking monitoring platform, which collects and stores the measurements while making them available for further processing. This has been complemented by several activities that have improved our ability to manage and use the network topology and network metrics for analytics [3].  WLCG Network Throughput Working Group was established in 2014 to help with some of the underlying tasks, such as overseeing the global network of measurement agents based on perfSONAR [4], establishing baseline measurements and performing low-level debugging activities. This has lead to a dedicated network throughput support unit, which has proven to successfully coordinate and resolve complex network performance incidents within LH-COPN and LHCONE [5].

Network Performance
Networks that connect sites and experiments need to handle ever increasing amounts of data and convey it across multiple networks around the world. Due to the underlying complexity, end-to-end performance depends on a number of components and their operational status anywhere within the network. When a network is under-performing or errors occur, it can become very difficult to identify and correct the source of the problem as local testing will often not find the cause, as errors can occur anywhere along the path of data as it moves between multiple networks. While disconnect failures are relatively easy to detect and fix, soft failures where a network continues to function but has compromised performance can be very hard to detect. Identification of such problems is best served by the active end-to-end measurements against a predefined target, which in the scope of WLCG and OSG means a global network of agents testing all possible network paths end to end.

perfSONAR
Such global network of agents has been established in collaboration with WLCG and OSG sites based on perfSONAR, which is a network measurement toolkit designed to provide federated coverage of paths that helps to establish the end-to-end usage expectations. It's an open source software, developed by a consortium of ESnet, Internet2, Indiana University, University of Michigan and GEANT [4]. It provides a number of tools that can take various different network measurements covering different aspects of network functions, bundled in a comprehensive package including tools, scheduler, visualisation and centralised management functions such as configuration and discovery, see  The toolkit provides a range of standard metrics that provide useful insights into the current state of the network. For latency and loss, apart from ping, it offers implementation of the one-way and two-way active measurement protocols (OWAMP/TWAMP) [6,7]. As shown in Fig.1b, such protocols perform an active testing of two endpoints based on synchronised injection of packets while measuring the base characteristics of packet travel such as roundtrip or one way delay, jitter, packet loss, time-to-live (hops), packet duplication and packet re-ordering. This is more precise than the standard RTT measurement based on ICMP and allows to test one-direction delays which are otherwise not possible. TWAMP, which was recently added to the toolkit, has been also implemented by some major network vendors 1 and can be thus used to test to the network equipment along the path. An important metric for end-to-end network performance is throughput, which can be measured by three different tools: iperf3, iperf2 and nuttcp. The most common is iperf3, which can perform memory to memory tests over UDP or TCP and reports TCP retransmits and size of congestion window, which are both very useful in troubleshooting. The final part of the network characteristics is the network path, which can be measured by traceroute or tracepath, the latter being preferred due path MTU discovery as it can determine maximum transmission unit (MTU) along the path and serves as an important indicator of MTU issues which have become quite common.

WLCG perfSONAR Infrastructure
In order to gather network-specific metrics, WLCG/OSG perfSONAR deployment was established as a subset of the global perfSONAR deployment dedicated to the needs of WLCG and OSG (see Fig. 2). It has been operated since 2013 and involves most of the WLCG Tier-1 and Tier-2 sites, all OSG sites as well as testing endpoints at the major R&E network hubs (endpoints on LHCONE within ESNet, GEANT and Internet2 hubs) and associated HEP projects such as Belle II sites. PerfSONAR Toolkit is used to instrument the end-sites with the capability to make a standardised set of network measurements, which can be centrally configured. Each network measurement site usually provides two types of dedicated Figure 3: OSG Network Monitoring Platform -distributed deployment that collects, stores, visualises and provides APIs for the measurements collected by the WLCG perfSONAR infrastructure perfSONAR services: 1) latency and 2) bandwidth. The latency instances are measuring end-to-end latency, packet loss using OWAMP or TWAMP while the bandwidth instances are measuring throughput (typically via iperf3) as well as registering the network paths using either tracepath or traceroute. WLCG and OSG specific documentation on perfSONAR is available covering motivation, deployment options, installation guide, configuration, use and troubleshooting [8].

OSG Network Monitoring Platform
OSG has developed and deployed a comprehensive network monitoring platform [9] that collects, stores, visualises and further processes all the measurements taken by the perfSONAR infrastructure, see Fig 3. At its core is a collector, which regularly connects to the remote perfSONAR toolkits, downloads all recent measurements and publishes them to the message bus, based on RabbitMQ. This stream is then used to feed three different types of stores, a short-term store located at University of Chicago, which stores data for the last 6 months, a long-term store located at University of Nebraska, which stores the entire dataset and finally a tape system at FNAL, which is used as a persistent backup. The measurements stream is also available to the experiments via ActiveMQ bus at CERN which is populated by a dedicated bridge connected directly to RabbitMQ. The platform is also integrated with the ATLAS Analytics and Machine Learning Platform [10] that makes it easy to combine and analyse network measurements with metrics from various different sources (including Panda, FTS, Rucio, etc.).
The platform also contains a centralised configuration system (PWA/PSCONFIG), which is used to configure the tests specifications (tools/measurements specs), meshes (collection of hosts participating in the tests) as well as test schedule for the entire infrastructure. There is also infrastructure monitoring that oversees the status of the platform and measurement infrastructure and a set of dashboards that visualise the results in various different ways [11]. In addition, there are number of dashboards available as part of the CERN monitoring infrastructure [12], based on Grafana, offering different views on raw data as well as more complex combined views that put together perfSONAR, network utilisation and data transfers, see

Activities and Collaborations
The platform and measurement infrastructure have been used in number of activities and collaborations, improving our understanding of the networks and contributing to the technical evolution and design.
Establishing end-site network throughput support has helped to resolve number of challenging cases that would otherwise be very difficult to detect and isolate or would take considerable amount of time to resolution [13]. In addition, the unit has helped sites with their data centre network design, consulting on the potential bottlenecks caused by the network equipment with insufficient buffers as well as helping to test and benchmark their performance. The feedback gathered from the support unit on the different cases has lead to a discussion and a concrete proposal for MTU recommendations for LHCOPN/LHCONE [14], which aims to improve the overall throughput and standardise MTU deployment across R&Es and sites.
There were number of significant contributions to the development and design of network performance monitoring over the years, a notable example is the the current configuration system, which was initially developed as an internal OSG tool and was later adopted by the perfSONAR consortium. Another area of close collaboration was deployment and testing of the IPv6 readiness, which was lead by the HEPiX IPv6 working group [15]. This was a particular example how the platform can be useful in the future to evaluate potential deployment of the new technologies (such as new TCP congestion control algorithms, software defined networks, etc.). Another such example is a collaboration with HELIX NEBULA Science Cloud project [16], which used the platform to assess network performance of the cloud providers. Finally, close collaborations were established with other research domains and institutes that have also shown interest in network performance and deployment of a similar platform as the one deployed for OSG/WLCG.

Network Analytics
Establishing the OSG Network Monitoring Platform and making the data available for experiments and network researchers has triggered great interest from different communities that have started to look at the existing measurements and performed analysis with various different goals. At the same time, the platform has made it possible to diagnose and debug existing network issues, identify the problematic links or equipment and help fix the underlying problems. Among the several past and present projects, the following have delivered notable results or identified important areas where further research is needed: • Real-time detection of "obvious" issues and corresponding altering and notifications have been developed at University of Chicago and is currently being tested as part of the ATLAS Analytics and Machine Learning Platform [10].
• A study to derive how LHCOPN network paths perform from the existing OWAMP measurements has shown that OWAMP is sufficiently sensitive to pinpoint when network equipment gets stressed and could be used to easily detect peak periods. The main challenge that still remains is how to extend the model to LHCONE, mainly due to the lack of reliable network traffic data that could be used to train the neural network [17].
• Combining the existing data from the experiments data transfers and compute a network cost matrix for all existing links that could be used to optimise job and data placement and job scheduling. A prototype has been developed and test, but for the production deployment requires removing the delay of the collector in order to get close to real-time estimates. This in turn requires a capability to publish the results directly from the agents, which is one of the foreseen areas of development.
• Automated debugging of network issues and assistance in finding the root causes in realtime was developed and tested as part of NSF funded project PuNDIT [18].
• Creating a real-time model of our networks as a graph in a graph database has been prototyped and is one of the potential areas for further research as finding models that would help us overlay the existing metrics (such as throughput, delay) over an existing graph of network paths (in a time series as paths could change over time) would significantly help in implementation of automated debugging tools or network workflows simulations.
• Network path analysis project is currently ongoing at University of Michigan and aims to calculate simple statistics from the existing path measurements in order to auto-detect potential routing problems and help with the visualisation of the measurements.
• Understanding the differences between network utilisation as seen by R&E networks and network utilisation as computed from the experiments data transfers is another area of interest. While there has been significant effort contributed to understand network utilisation from the bulk data transfers, there are still major gaps in getting reliable sources of information directly from the R&E networks.
Further analytical studies are planned to better understand our use of networks and how it could be improved. The new versions of perfSONAR plan to integrate direct publishing of the results and configurations needed to operate it globally that would help us make progress in number of areas requiring access to real-time data as well as providing automated debugging and optimisations.

Future
In summary, OSG in collaboration with WLCG have established a comprehensive network monitoring platform that has been used in a number of activities ranging from operations and support and technological deployments up to the research and developments for the network analytics. We have established and made progress in several areas of the network monitoring and plan to continue to evolve in the same areas also in the near term. There are number of areas where significant R&D effort will be needed to progress on some of the previously mentioned challenges, but there are also number of opportunities that could provide funding and effort to continue the work. Two projects that will lead the operations and development in the HEP network monitoring are NSF funded IRIS-HEP and SAND. IRIS-HEP will fund the LHC part of Open Science Grid, including the networking area and will create a new integration path (the Scalable Systems Laboratory) to deliver its R&D activities into the distributed and scientific production infrastructures. Service Analysis and Network Diagnosis (SAND) will be focusing on combining, visualising, and analyzing disparate network monitoring and service logging data. It will extend and augment the OSG networking efforts with a primary goal of extracting useful insights and metrics from the wealth of network data being gathered from perfSONAR, FTS, R&E network flows and related network information from HTCondor and others.