Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

Software defect prediction is an activity that aims at narrowing down the most likely defect-prone software modules and helping developers and testers to prioritize inspection and testing. This activity can be addressed by using Machine Learning techniques applied to software metrics datasets that are usually unlabelled, i.e. they lack modules classification in terms of defectiveness. To overcome this limitation, in addition to the usual data pre-processing operations to manage mission values and/or to remove inconsistencies, researches have to adopt an approach to label their unlabelled software datasets. The extraction of defectiveness data to label all the instances of the datasets is an extremely time and effort consuming operation. In literature, many studies have introduced approaches to build a defect prediction models on unlabelled datasets. In this paper, we describe the analysis of new unlabelled datasets from WLCG software, coming from HEP-related experiments and middleware, by using Machine Learning techniques. We have experimented new approaches to label the various modules due to the heterogeneity of software metrics distribution. We discuss a number of lessons learned from conducting these activities, what has worked, what has not and how our research can be improved.


Background
Machine learning (ML) as a means to help in different Software Engineering (SE) tasks, such as software defects prediction and test code generation, has been often considered in research studies in the last decades [1][2][3][4][5]. ML techniques are fed with input software data properly processed and collected in datasets that are composed of instances, i.e. software modules (such as files, classe and functions), and features, i.e. software metrics [6]. For the software defect prediction, the actual defect information of instances is also mandatory in supervised ML techniques; nevertheless, it may be not enough in the software archives of new or recent software projects, or it can not have been traced properly in already existing software projects [7]. This constitutes a serious limitation in the utilization of supervised-based ML techniques [8].
To address the limitation of supervised-based learning techniques in constructing defect prediction models by using unlabelled datasets, researches have proposed various approaches which can be categorized in five groups.
1. The within-project defect prediction (WPDP) is a typical prediction process built for a specific project and is based on supervised ML. This approach is characterized by a high precision, however, its prediction model can hardly be used for other projects' prediction being built on a single software project [9].
2. The cross-project defect prediction (CPDP) builds a prediction model using a labelled datasets. Then, it uses the same model to predict if an instance of another software project is defective or not. This approach might be useful in case of new projects or projects with limited defect information, however, it assumes that the two datasets have the same set of metrics and that they have the same probability distribution [10].
3. The expert-based defect prediction first employs a clustering algorithm, like K-means, to cluster the unlabelled instances, then it relies on a human expert for labelling each cluster as defective or not [11]. The major limitation of this approach is that it requires human experts to categorize cluster as defective.
4. The threshold-based defect prediction approach predicts an instance as buggy when any metric value is greater than the given metric's threshold. This approach can be automatized, however, defectiveness prediction is dependent on metrics thresholds which must be established in advance [12].
5. The Clustering, LAbelling, Metric selection, Instance selection (CLAMI) approach is based on a four-step procedure to be applied to the instances of an unlabelled dataset. It is an automatizable approach, which does not involve human effort and relies on metrics' values which may not always be comparable and may introduce bias. CLAMI is dependent on metric thresholds [13]. CLAMI+ is an evolution of the CLAMI approach: it employs a different procedure in the metrics' selection phase. CLAMI+ is still dependent on thresholds, but it normalizes metrics' values [14].
The extraction of the complete set of features (metrics and labels) is time and effort consuming: moreover, the selection of the right tool for metrics' extraction can be difficult to conduct. For these reasons, unlabelled datasets are the vast majority of software datasets. To perform defect prediction with unlabelled datasets it is necessary to find an automatizable way to label instances.

Experimental Setup
The Worldwide LHC Computing Grid (WLCG) [15] employs a wide variety of software whose vast majority has been using devops procedures in their development and maintenance phases. The adopted tools collect software metrics, e.g. cyclomatic complexity or lines of code [16,17], over the releases, that can be used to build software datasets. In our study, we have found that software projects have documentation related to code changes, like release notes, which can be exploited to provide an assessment of the defectiveness prediction in software.
Our work aims at testing the usefulness of ML techniques in WLCG domain in terms of the identification of the pieces of code, which require particular attention during the development and maintenance phases of sofware. ML techniques can help in selecting software modules that should be examined with greater care by developers. To achieve this goal, we have constructed a defect prediction model by exploiting the unlabelled software datasets of Geant4 that is one of the most rigorously validated software packages for the simulation of the passage of particles through matter [18]. Amongst the different ML methodologies, we have selected CLAMI [13] and CLAMI+ [14] in order to label the instances in the software datasets. In addition, we have applied a large set of ML techniques to predict defect-prone modules.
Our approach (summarized in Figure 1) uses as input a subset of the Geant4 software dataset composed of 482 modules with 66 software metrics and 34 releases. This dataset has been obtained by applying a tool for the static analysis -Imagix 4D tool [19] -to the various modules of several software releases [20]. Imagix 4D tool's output has been preprocessed in order to keep only the common software modules and software metrics among the Geant4 software versions. This preprocessing activity has produced a 34 multiversions dataset each  The input unlabelled dataset is turned into labelled dataset in 4 steps, 3 out of 4 are based on the CLAMI approach. In the first step we have split the unlabelled dataset into a training and a test dataset. The training dataset is composed of the 67% of the total instances, the remaining instances have been included in the test dataset. With the aim of labelling the training dataset, we have applied the CLAMI approach.
We have used an unsupervised algorithm to cluster instances by relying on the magnitude of their metrics. More in detail, in the clustering phase, we have identified, for each instance, the metrics that are greater than a specific cutoff threshold (e.g. the median value) and then determined the number of metrics K, whose values have exceeded the threshold. Afterwards, the instance have been clustered according to their K values. By relying on previous literature, it is known that instances with larger values on their metrics are more likely defective [21][22][23]. Therefore, we have discriminated the top half cluster as defective code and the bottom half as non defective code.
The metrics selection phase considers the exclusion of the metrics that violate the defect proneness tendency [24]. The violation occurs, for example, when in a defective-labelled instance, a metric does not exceed its cutoff threshold, or, on the contrary, when, in a non defective-labelled instance, a metric exceeds its cutoff threshold. For each metric, we have computed the metric violation score (MVS) that is the ratio between the number of violations and the number of metric's values. The metrics with lower MVS are selected for the training dataset. The last CLAMI phase considers to remove all the instances with any violated metric values. If this operation produces a training dataset without either defective or non defective instances, then another MVS value is to be chosen and the last step need to be reiterate. The CLAMI+ approach differs from CLAMI in the Clustering and Labelling steps ( 2a and 2b in Figure 1 ). More in detail, CLAMI+ transforms the Boolean representation in CLAMI of metrics' violation into a probabilistic value based on the difference between the metric value and the threshold. Consequently, CLAMI+ considers how much an instance violated on a metric and leads to a different selection of the final training set that is expected to be more informative than that built by CLAMI [14].
Once obtained the labelled dataset, we have applied various ML techniques, previously selected among those that have already been used in the Software Engineering field, and, more in detail, in the software defect prediction problem, as presented in previous literature [25]: AdaBoost (AB) [26], Boosted Logistic Regression (BLR) [21,27], J48 [28], Cost-Sensitive C5.0 (C5.0 Cost) [29], Logistic Model Tree (LMT) [30], Multilayer Perceptron (MLP) [31], Support Vector Machines with Radial Basis Function Kernel (SVM Radial) [32], Partial Least Squares (PLS) [33], Boosted Tree (BT) [34] and Random Forest (RF) [35]. In order to compare the different ML techniques, we have employed the most common performance indicators detailed in literature. For a question of space, the indicators that we have shown in section 3 are: • Accuracy that measures the percentage of instances correctly classified as either defective or non defective; • Kappa statistics [36], whose value ranges from 0 to 1, that determines how much better a classifier is performing over the performance of a classifier that simply guesses at random. When Kappa's value is between 0.81 to 0.99, this value indicates an almost perfect agreement. • Area Under the ROC (Receiver Operating Characteristic) Curve (AUC) [37] that is able to consider the ability of a classifier to differentiate between the two classes. AUC has lower variance and is more reliable than other performance metrics for software defect prediction [38].
The assessment of our predictions has been conducted by comparing our results against the software documentation like release notes.

Lessons Learned
In this section, we will discuss the key lessons that we have learned from our experience and also propose activities for future research. It is worth highlighting that our approach uses as input a subset of the Geant4 software dataset that is composed of 482 modules with 66 software metrics and 34 releases.
In the labeling phase, we have determined the defect-prone modules for each release and different quantiles according to CLAMI and CLAMI+ approaches. Lower cutoff values identify less defect prone software modules and, on the other hand, modules with larger values in all metrics are more likely defective [22]. This study considers just a subset of modules available for each release, therefore even though the labeled modules as defectives have found a correspondence in the Geant4 documentation, we believe the current results may envelope a bias for which we have a further investigation.
In the metrics selection phase, we have removed from 33% to 55% of the total number of metrics because our choice was to keep the metrics that were in common to the various releases and quantiles (i.e. cut-off values choose for the metrics). Therefore, the average number of the selected metrics is 38 out of 66 and, more in detail, this resulting set was composed of metrics belonging to the size, complexity, maintainability and object orientation categories.
For the defect prediction, we have applied various classification and regression techniques on training datasets with 10-fold cross validation and assessed them on test datasets. We have noticed that Kappa static value inferior to 0.81 corresponded to values of Accuracy inferior to 90%. We have excluded all the prediction models whose Kappa statistic scored less than 0.81 in order to consider a good agreement between the observed and the expected accuracy. Table 1 and Table 2 show which ML techniques perform best, in terms of accuracy and AUC, by using either the CLAMI approach or the CLAMI+ approach, over the various quantiles. Quantiles' numbers from 1 to 9 correspond to 10%, ..., 90% cutoff values respectively.  With regard to internal validity, each Geant4 release contains release note that includes information about new development, change, bug fixes, performance improvement and so on.
According to this documentation the modules traced for changes and included in the small dataset have also been labelled as defect-prone during the labelling phase. We think that those modules may contain false positives and true negatives, and thus future work would involve investigating those defect-prone modules and confirming their validity. Geant4 is on board by many years: it is important to understand what is an improvement and a bug fix. Furthermore, the studied dataset may not be representative of all Geant4, and further investigation is need to confirm our findings.
In terms of external validity, we have considered 34 Geant4 versions, which differ in size, complexity, popularity and revision history. Our small dataset may not take the place of all kinds of WLCG software, and we have to extend our study to other types of source code, such as ROOT.

Conclusion
We have reported our experience which includes the labelling of a Geant4 multi-release unlabelled software dataset and the application of several Machine Learning techniques. The supervised machine learning algorithms explored take as input a representation for source code in terms of code metrics, and predict if a module is defective-prone or not. We have selected the machine learning techniques that scored more than 0.80 in the Kappa statistic performance metric: J48, Adaboost, LMT and Bagging. We have discovered that our findings on defect prone modules have a correspondence in the analysis of the software code documentation (e.g. release notes).
This research is at an early stage of effort. The lessons we have learned will lead to different improvements in order to make it more accesibile to readers in terms of explainability, scalability and usefulness.
We plan to extend our approach to the whole Geant4 datasets and apply this approachto other types of WLCG software. In addition, we would like to explore how our approach can be extended to assessing different types of defects such as bug fix, performance improvement and so on.