The ATLAS Wide-Range Database and Application Monitoring

In HEP experiments at LHC the database applications often become complex, reflecting the increasingly demanding requirements of the researchers. The ATLAS experiment has several Oracle DB clusters with over 216 database schemes each with its own set of database objects. To effectively monitor them, we designed a modern and portable application with exceptionally good characteristics. Some of them include: A concise view of the most important DB metrics; a list of top SQL statements based on CPU, executions, block reads, etc.; volume growth plots per schema and DB object type; a database jobs section with signalization for failures; and in-depth analysis in case of row-lock contention or DB sessions. This contribution also describes the technical aspects of the implementation. The project can be separated into three independent layers. The first layer consists in highly-optimized database objects hiding all complicated calculations. The second layer represents a server providing REST access to the underlying database backend. The third layer is a JavaScript/AngularJS web interface. In addition, we will summarize the continuous integration cycle of the application, which uses GitLab-ci pipelines for basic testing, containerization and deployment on the CERN Openshift infrastructure.


Introduction
In the ATLAS experiment [1] we have a wide spectrum of database applications dealing with data processing and bookkeeping as well as data, such as event meta-data, detector calibration constants, and detector geometry description. The data are hosted in several Oracle databases [2]. We can identify three production databases with different purposes: one related to the detector data (accessible from the network in the ATLAS experimental area), one for the offline related data and one for distributed computing data. In addition, there are two full copies of the online and distributed databases and two test databases dedicated to applications developments.
Over 100 unique applications exist in the ATLAS databases. Some of those are spread among several database schemas, others own thousands of database objects and some execute hundreds of parallel processes per second. Specific applications such as PanDA[4], Rucio [3], WinCC [5], are of significant importance for the scientific community and any disturbance in serving their DB workload has to be avoided. Moreover, early detection of potential database issues might be important for smooth operation of the databases.
In order to perform monitoring and provide efficient support for the database schemas, a dedicated Web application was developed. The ATLAS Database & Application Monitoring (DBMon) [6] is a custom Oracle DB monitoring tool created for the needs of the ATLAS application developers and the Oracle database administrators. Its goal are to: • present an overview of the current state of the database clusters • provide detailed information about a selected database cluster • produce reports based on specific database schema name

Architecture
The project has three independent layers, shown in Figure 1 as a Database component on the left; a Java Application component in the centre; and a web application component shown on the right. The first layer consists of highly-optimized database objects fed by sophisticated calculations in PL/SQL functions and procedures. The second layer represents a Java application providing REST access to the underlying database back-end. The third layer is a JavaScript/AngularJS web interface visualizing the data.

Database setup
The DBMon application resides at one of the ATLAS databases (the offline cluster), where a couple of scheduler jobs run regularly to collect information for the current database as well as the additional clusters. The jobs call various PL/SQL [7] procedures, which read data from multiple DB view objects (such as v$sql, v$sessions, v$dba_objects, etc.). Then the data are inserted into a dedicated set of tables.
The information is primarily statistical, such as the number of sessions and their distribution across the database instances; details about running/failed/succeeded jobs; a full log of currently running queries and number of executions; a list of used CPU resources; etc. The automation of the process boosts the administrators productivity and improves their efficiency. In addition, it provides valuable information to developers who work on their application database layer.

Java/REST application
The REST application is based on Spring [8] and Jersey [9], providing a number of HTTP endpoints for feeding the Web application with data. The code is minimalistic. All query calls are wrapped into PL/SQL functions, which allows easy changes when necessary. For example: Behind get_blocking_sessions function there is a complex hierarchical query which uses dynamic parameters. The result is sent in the form of a simple table, whereas the logic remains hidden. Moreover, refining the query does not require a person to become familiar with the Java [10] project.

Web application
The front-end of the application is built with the AngularJS [11] framework. The technology is JavaScript-based and provides Model-View-Controller [12] design, which makes it suitable for implementing the database monitoring.
The concept behind AngularJS gives us tools for creating a well structured application, works well with the popular JavasScript[13] libraries and allows two-way data binding between the Model and the View. The "scope" is one of the building blocks of AngularJS. It forms the Model and allows the View and the Controller to access all variables in their specific scope.
Since the monitoring application is developed with the idea of being constantly extendable, we took advantage of another AngularJS feature -the directives. They let us create reusable components that could be called with a custom HTML tag and different parameters. For example, a bar-chart can be rendered by adding the following line: For the creation of interactive charts we use Highcharts [14]. The monitoring is dependant also on Bootstrap [15], Angular Material [16] and TreeGrid [17]. All of the libraries are open source and actively supported.

Features
The presented application is a tool that allows developers to explore and tune the application's database performance. As such, it provides various data representations through graphical components, which also help in detecting database issues and abnormal application behaviour. The key features are listed below: • Historical plots of database activity -DBMon provides a set of 10 plots describing the state of the different database instances over specified period. This helps in identification of problematic database or application behaviour by narrowing down to start, end and DB instance.
• Top resource consumers (Fig. 2) -When a database experiences high load, it is crucial to efficiently understand the root cause for that. DBMon machinery captures all DB requests (among thousands per second), having longer than 3 sec execution time and stores their most important metrics. Then the SQL statements are ranked and the top 10 of them are displayed in charts as seen bellow. • DB volume growth per schema (Fig. 3) -ATLAS databases host several applications which data volume is growing fast and needs to be closely monitored. Furthermore, developers benefit from the fact that DBMon provides a way of plotting the size of their DB schemes.
• Tracking the number of active/inactive sessions per DB user (Fig. 4) -A common reason for increased database activity is the high number of active sessions for a specific application. Therefore, DBMon gives the number of active/inactive sessions spread among the DB cluster machines. Moreover, the monitoring provides additional in-depth details per session (e.g. SQL ID, service name, reason for waiting, etc.).  • Database jobs' state -Although, database jobs are not widely used by the ATLAS applications, there is number of those that run regularly on the ATLAS databases. Some of them gather statistics, others make calculations or feed other tables. It is important to have a tool that observes the state of the Oracle jobs and notify administrators when there are failures.
• Blocking/blocked sessions and potential reasons for the blocking (Fig. 5) -The ATLAS production databases serve transactional workload by hundreds of concurrent processes. In certain circumstances that might lead to row(s) lock contention or high concurrency on table or index blocks ("hot blocks"). In those cases, DBMon provides valuable information by displaying an hierarchical tree which shows unambiguously the blocker session as well as details about it. This is one of the ATLAS DBMon features that excels compared to the other monitoring tools. Moreover, blocking/blocked sessions information is stored historically for subsequent analysis. • SQL statement heat map (Fig. 6) -Whenever a SQL statement uses bind variables it keeps the same ID for all executions. Thus, for each SQL ID Oracle stores performance statistics in the Automatic Workload Repository (AWR). Our monitoring provides a heat map that highlights the values increased with more than 20% in comparison with the preceding AWR statistics period. • SQL execution plan(s) -The Oracle CBO (Cost-Based Optimizer) might decide to change the execution plan of any query in the database based on consideration of various parameters and data distribution within the tables. DBMon retrieves a list of current and previous execution plan(s), which is useful for understanding why and when a SQL execution plan had been changed.
• Storing monitoring data -All monitoring data is stored historically, which is extremely beneficial in situations when DB issues occurred outside the working hours.

Deployment
ATLAS DBMon uses GitLab CI (Continuous Integration) [18] pipelines for basic testing, containerization and deployment on the CERN Openshift [19] infrastructure. A Docker [20] image of the application is created during the process and deployed via the OpenShift client whenever there is a commit in the Master on GitLab. The Web-page is protected via CERN SSO (Single Sign-on) and restricted to be accessible only by ATLAS members.