Enhancements in Functionality of the Interactive Visual Explorer for ATLAS Computing Metadata

The development of the Interactive Visual Explorer (InVEx), a visual analytics tool for the computing metadata of the ATLAS experiment at LHC, includes research of various approaches for data handling both on server and client sides. InVEx is implemented as a web-based application which aims at the enhancing of analytical and visualization capabilities of the existing monitoring tools and facilitates the process of data analysis with the interactivity and human supervision. The current work is focused on the architecture enhancements of the InVEx application. First, we will describe the user-manageable data preparation stage for cluster analysis. Then, the Level-of-Detail approach for the interactive visual analysis will be presented. It starts with the low detailing, when all data records are grouped (by clustering algorithms or by categories) and aggregated. We provide users with means to look deeply into this data, incrementally increasing the level of detail. Finally, we demonstrate the development of data storage backend for InVEx, which is adapted for the Level-of-Detail method to keep all stages of data derivation sequence. 1 Copyright 2020 CERN for the benefit of the ATLAS Collaboration. Reproduction of this article or parts of it is allowed as specified in the CC-BY-4.0 license. © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). EPJ Web of Conferences 245, 05032 (2020) CHEP 2019 https://doi.org/10.1051/epjconf/202024505032


Introduction
InVEx (Interactive Visual Explorer) [1] is a toolkit for visual analysis of multidimensional data. This visualization system was developed to cover the needs of representation of the computing metadata of the ATLAS experiment [2] at Large Hadron Collider (LHC) and can address similar tasks in other domains. The InVEx functionality for data analysis, building interactive 3D models and visualization of the results of clustering stands on machine learning methods. The development of InVEx started with implementation of a 3D interactive tool for cluster analysis. Its further evolution was driven by requirements formulated within the ATLAS computing experts community and aims to ensure stability and efficiency of the distributed computing environment. In the process of the integration of InVEx with ATLAS computing metadata sources we faced two main challenges: 1. big data volumes needed to be analyzed in real time. For example, one ATLAS computing task consists of more than a thousand jobs. Each job is described by two hundred different parameters. 2. unsupervised clustering algorithms are not sufficient for visual cluster analysis. This is why we decided to add the ability of user-defined clustering/grouping (by nominal or ordinal parameters). It makes the process of data analysis more manageable. This paper demonstrates three main enhancements in the InVEx functionality: usermanageable data preparation, Level-of-Detail, and improving of the data storage format.

Data Preparation
Data preparation is the prerequisite step of the real-time data analysis. Data taken from a remote source, i.e., from one of the ATLAS computing monitoring systems -the BigPanDA monitor [3], contains metadata of computing tasks and jobs, filtered by particular conditions, like time range, job status, computing site, etc. Each job may have up to 200 features of various data types. InVEx allows using an arbitrary arbitrary set of features for the data exploration process and therefore may download the entire available set. InVEx GUI provides a special feature navigation panel, called DatasetInfo, where data descriptors are grouped by types: numerical continuous, ordinal, nominal, range non-categorical (such as certain string data that can't be treated as categorical, e.g., error log messages, job names). -

Level-of-Detail Method
The data sample for real-time exploration of PanDA [4] computing payload starts from thousands of jobs and may exceed hundreds of thousands, depending on the filtering conditions and selected time range. The implementation of the Level-of-Detail Generator (LoD) [5] was initiated to avoid visualization of such a huge amount of data objects in one scene. Initially, LoD provided clustering of the initial data sample using the MiniBatchKMeans 2 algorithm only and returned clusters (i.e., groups) as aggregated data objects. Thus, the analyst could work with the small number of data objects, reflecting the structure of the initial data (moreover, visualization of a reduced number of objects works faster and is perceived easier). The current InVEx release was enhanced by using the performant implementation of KMeans algorithm from Intel Data Analytics Acceleration Library (DAAL) 3 , and also by providing initial grouping of the source large data sample by known features, which, sometimes, is more representative for the analyst. LoD supports the following methods: MiniBatchKMeans Clusterization (python/sklearn), Fast KMeans Clusterization (python/daal4py), K-Prototypes Clusterization (python/kmodes), Grouping by nominal/ordinal parameters (python/pandas), Grouping by numerical continuous parameters (python/pandas).

Data Exploration Pipeline
The pipeline of data exploration in the InVEx consists of the following steps: 1) Upload initial data sample from a remote source or the

Data Storage Format
The storage backend of InVEx serves to store the whole analysing data sample consisting of every stage of hierarchical data in the derivation sequence, which starts with the initial data sample as raw input data. The reason of the persistent storing of the entire input data consists in the peculiarity of the exploratory data analysis: it implies that user doesn't know in advance the structure of data and might change selection criteria many times. Storing the entire raw data at the beginning obviate the need to download the data again and again. Together with data storage, InVEx stores also the operations performed against it such as clustering, grouping and LoD changes. Different storage implementations were considered to fulfill the following requirements: the sufficient size of the input data sample (up to millions of records) and the arbitrary number of data analysis pipelines that also should be stored. We considered various storage backends, such as file storage, relational database and non-relational database. In the case of file storage, the data sample itself and all stages of the processing pipeline can be stored in a single file or in a folder with a number of files. Such storage is scalable, but navigation might become complex with increasing size (with a plain text format, e.g., JSON). Relational database is not sufficient for storage of data samples containing variable schema. Non-relational database is not reviewed as InVEx doesn't assume the common storage for all sessions. As we deal with big data samples, each analytics session must be encapsulated as much as possible. Thus, we choose HDF5 file storage format as one of the most suitable for InVEx. The HDF5 data model allows to store data sample and metadata in one file. HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets. An HDF5 file (an object in itself) can be represented as a container (or group) that holds a variety of heterogeneous data objects or datasets, where objects are images, tables, graphs, or even documents, such as PDF or Excel. Two primary objects in the HDF5 Data Model are groups and datasets. Groups include other group objects (sub-groups) or datasets, thus are introduced as organizational units. All objects are self-describing, since every object has assigned attributes (as one of HDF5's advantages). The structure of the InVEx storage backend based on HDF5 is shown in Figure 3. "Base_group" is a root object with datasets describing the input data sample ("default" dataset includes raw input data as a NumPy array). "Modified" sub-group object includes modified data samples: data with numeric values only and its normalized edition, and data with non-numeric values. Every "modified" group has a set of sub-groups (no restriction on its number) that represent applied operations with corresponding list of resulted clustering labels ("operation_X"). With LoD turned on there is a dataset "LoD" that contains information about grouping of the initial records, which is used to create sub-groups "group_N" (from "modified" group) dynamically from the initial or previously grouped data sample. Such sub-groups are created on user demand for analysis of a particular group of records.

Conclusion
The InVEx toolkit can be compared with visual analytics tools such as Clustervision [6], Clustrophile [7], XCluSim [8], however these tools don't provide 3D visualization, that may be useful for the exploration of spatial objects, and flexible Level-of-Detail mechanism, that is crucial for the analysis of big volumes of data. It should be noted, that the Level-of-Detail component in InVEx, in particular, uses Intel DAAL library, providing significant increase in the performance of the K-means clustering, in contrast with python/sklearn library. InVEx was developed and tested on the computing metadata of the ATLAS experiment at LHC and it is applicable to the exploration and monitoring of various data. The next stage of the development is the addition of supercomputing resources for data preparation and processing for visual analysis.