Machine Learning-based Anomaly Detection of Ganglia Monitoring Data in HEP Data Center

. This paper introduces a generic and scalable anomaly detection framework. Anomaly detection can improve operation and maintenance e ﬃ - ciency and assure experiments can be carried out e ﬀ ectively. The framework facilitates common tasks such as data sample building, retagging and visualization, deviation measurement and performance measurement for machine learning-based anomaly detection methods. The samples we used are sourced from Ganglia monitoring data. There are several anomaly detection methods to handle spatial and temporal anomalies within the framework. Finally, we show the rudimental application of the framework on Lustre distributed ﬁle systems in daily operation and maintenance.


Introduction
At present, the Institute of High Energy Physics (IHEP) local cluster consists of 20,000 CPU slots, hundreds of data servers, 20 PB disk storage and 10 PB tape storage. After data taking from the Jiangmen Underground Neutrino Observatory (JUNO) and the Large High Altitude Air Shower Observatory (LHAASO) [1] experiment, the data volume processed at this center will approach 10 PB per year. Prompt anomaly detection can improve operation and maintenance efficiency and assure high energy physics experiments can be carried out effectively. We develop a generic anomaly detection framework based on machine learning.
Anomalies are data points which are either different from the majority of others or different from the expectation of a reliable prediction model in a time series. For Ganglia monitoring metric data, we classify anomalies into spatial anomalies and temporal anomalies. Spatial anomalies are points of high-dimensional data without time dimension. For temporal anomalies, they may not be spatial anomalies, but they are quite different from the current sequence data by analyzing temporal characteristics.
We have broken down some machine learning-based methods into four broad categories. First, based on a classification [2] , anomalies can be detected by trained models with the help of supervised learning algorithms such as xgboost, random forest, etc. Second, based on cluster analysis [3,4], samples are clustered by density analysis and cutting such as k-means, Isolation Forest [5], etc. The third and fourth categories detect anomalies by analyzing the difference between the real value and the predicted value after predicting. The prediction methods of the third category are based on statistics [6] such as Autoregressive Integrated Moving Average model (ARIMA) [7]. The prediction methods of the fourth category are based on deep learning [8] such as Hierarchical Temporal Memory (HTM) [9,10], etc.
We develop a framework because a particular anomaly detection algorithm is usually applicable to only a special use-case. Unlike methods mentioned above, the anomaly detection framework we developed can detect anomalies by combining the relationship of multiple indicators in addition to using single metric. The framework we developed in Python is suitable to be expanded with statistical machine learning algorithms and deep learning algorithms. It provides some functions such as data sample building, retagging and visualization, deviation measurement and performance measurement for machine learning-based anomaly detection methods.

Architecture
As shown in the Figure 1, after collecting data from the Ganglia monitoring system and preprocessing, we detect spatial anomalies and temporal anomalies separately. In addition, we ignore irrelevant anomalies caused by particular known circumstances such as system upgrading by setting time intervals and nodenames in anomaly filter modules. We develop the framework based on Django [11], which is based on a MTV (modeltemplate-view) architecture. There are over twenty general metrics and about one hundred special metrics of the monitoring data, which is collected by the Ganglia monitoring system at IHEP. The timestamped monitoring data are stored in ElasticSearch [12]. We develop a CRUD (create retrieve, update, and delete) interface based on Python ElasticSearch interface in the data interface layer. Configuration such as a list of metrics and model information is stored in MySQL and we deal with MySQL interaction based on Django ORM (Object Relational Mapping). The template layer is a collection of HTML pages which correspond to common functions of anomaly detection tasks. The visualization presents the results of prediction and detection more clearly in line diagrams and scatterplots.

Algorithms
We have two classes of anomaly detection methods for temporal anomalies. The first one is detecting after extracting time series features. The other one is detecting anomalies by judging the deviation of predicted data and true data.

Time-series features extraction
We maintain data of W time steps as time series x. There are W data points per metric in the time series. We extract statistical features and fitting features for every time series.
Statistical features consist of some general statistical features, skewness, kurtosis, volatility indicator and statistical features about repeating data. General statistical features consist of maximum value, minimum value, mean, variance, standard deviation, median, dot product, sum, range, locations and relative locations of maximum and minimum. The volatility indicator measures the volatility of data by computing mean, mean of absolute value, sum of first difference and counting the number of values in x that are lower or higher than the mean of x. Statistical features about repeating data consist of the percentage of recurring values and some general statistical features after removing duplicate values.
We take the last k (k = 6, 12, 18, 24, 30, 36) data points of time series x as time series Y (collections of time series) respectively. Fitting features are these values which are the difference between the last element of time series x and the smoothed values of each time series of Y after Moving Average Algorithm [13], Weighted Moving Average Algorithm [14], Exponential Moving Average Algorithm [15] and Double Exponential Moving Average Algorithm [16]. There are over 30 statistical features after time-series features extraction. The number of fitting features depend on W.

Prediction algorithms
Statistical approaches consist of Moving Average (MA), Exponential Moving Average (EWMA) and linear regression (LR). We predict the current data by fitting and smoothing historical data based on these algorithms mentioned above.
Long short-term memory (LSTM) [17] is well suited to predict based on time series data. The output and the input of the next sequence can be calculated together to obtain the output of the next sequence. We select some of the historical metric data to predict the current metric data. The number of metrics of the current data is m and the number of metrics of the selected historical data is n. For each sample, the input is shaped as (length of window, n) and the output is shaped as (1, m). The main configuration parameters include the number of LSTM layers, units (the number of hidden neurons), epochs, the number of batch and fraction of data to reserve for validation.
To evaluate the performance, we provide error indicators of different models when fitting the time-series with different metric data. In the framework, the error indicators consist of Mean Error (ME), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Percentage Error (MPE) and Mean Absolute Percentage Error (MAPE).

Detection algorithms
The algorithm shown in 3.3.1 and 3.3.2 are used after prediction. The Isolation Forest shown in 3.3.3 is used after time-series extraction or for spatial anomalies.

N-sigma
The predicted value at time t is y t , the true value is x t . Because some metrics are more volatile, we capture the relative error as Equation 1.
We can give a threshold directly, but we have to change the threshold manually for different situations. We used the relative errors of metrics (bytes_in, cpu_idle and mem_free) of two metadata servers to draw violin plots respectively and the data set is approximately modeled by a normal distribution. We assume that the relative error of time windows follows a normal distribution and set a threshold by confidence probability.

Q-function
We compute the anomaly score by a Gaussian tail probability. The mean value is µ. We define the anomaly score (s t ) as follows. The closer anomaly score of a sample is to 1, the more likely the anomaly is. The anomaly score ranges from 0.5 to 1.

Isolation Forest
For Isolation Forest [18], we subsample randomly first and then build many binary trees based on cutting. In the process of building, we split randomly on features until each data tuple forms a leaf node or the height reaches the limit. For anomalies, they are more easily divided into leaf nodes, so their average path length is shorter than others. c(n) is the average pathlength of trees. The number of samples used for building trees is n. x is an instance and h(x) is the number of edges between the root node and the terminating node plus an adjustment c(T.size). T.size is the number of instances of the terminating node where x is located. We compute anomaly score S(x,n) based on the average path length E(h(x)): H(n) = ln(n) + 0.5772156649 For temporal anomaly detection, the features are time-series features. For spatial anomaly detection, the features are metrics of samples. The main parameters include the number of trees (n_estimators), the number of subsamples (max_samples), anomaly ratio (contamination) and whether to extract time-series features.

Data
The real metadata of Lustre file system [19] in the IHEP data center is used for the experiments. The metrics used are shown in Table 1. For models that require a lot of historical data to train, we use 8 metadata servers from August 1st, 2019 to September 30th, 2019. The test dataset is one of the metadata servers from October 1st, 2019 to October 25th, 2019. There are almost 140,000 training samples and almost 6000 test samples. The data in the database does not have an anomaly tag, so we cannot compute precision and recall rate accurately at present.

Prediction Experiments
We compare these algorithms (parameter configuration is shown in Table 2) with RMSE and MAPE. As shown in Figure 2 for network and CPU, the deviation is lower for LSTM. For memory, EWMA performs better. But LSTM can predict all metrics based on one model.

Anomaly Detection Experiments
We use three anomaly detection methods to detect anomalies. For temporal anomaly method 1, we use Isolation Forest (n_estimators=100, max_samples=256, contamination=0.0001) after extracting time-series features. For temporal anomaly method 2, the prediction algorithm is LSTM (parameters configuration as listed in Table 2) and the Q-function is the detection algorithm (st=0.9999). For spatial anomaly method, we use Isolation Forest (n_estimators=100, max_samples=256, contamination=0.0001). Detection results are shown in Figures 3-5.

Conclusion
In this paper, we introduced a generic anomaly detection framework which provides the generic functionality required for anomaly detection tasks such as data sample building, retagging and visualization, deviation measurement and performance measurement. It was  initially applied to the metadata servers of Lustre file system, but it is not yet in production. Furthermore, the framework provides extracted time-series, different prediction models, and anomaly detection algorithms. In the future, we will associate these anomalies with anomalies in the actual environment such as disk failure, traffic abnormality, etc.