Data Mining Techniques for Software Quality Prediction in Open Source Software: An Initial Assessment

. Software quality monitoring and analysis are among the most productive topics in software engineering research. Their results may be e ﬀ ectively employed by engineers during software development life cycle. Open source software constitutes a valid test case for the assessment of software characteristics. The data mining approach has been proposed in literature to extract software characteristics from software engineering data. This paper aims at comparing diverse data mining techniques (e.g., derived from machine learning) for developing e ﬀ ective software quality prediction models. To achieve this goal, we tackled various issues, such as the collection of software metrics from open source repositories, the assessment of prediction models to detect software issues and the adoption of statistical methods to evaluate data mining techniques. The results of this study aspire to identify the data mining techniques that perform better amongst all the ones used in this paper for software quality prediction models.


Introduction
The software used in scientific environment (e.g. the HEP software) is a rich mixture of in-house software and software taken from the large open source community [1].Computer scientists are therefore striving to produce and employ high quality software that, at the same time, has been increasing in size and complexity.In order to produce high quality software and save effort, scientists need to know which software modules are defective [2].
Data mining is the process of discovering interesting patterns and knowledge from large amounts of data [3].Figure 1 shows the transformation from static software engineering data to active data, performed by data mining.In regards to software engineering data, there are two important types of data sources: the former is the revision control systems (such as CVS, Subversion and Git) that manage the ongoing status of development, the latter is the defects tracking software (such as BugZilla and JIRA) [4].The aforementioned data constitute the input of one or more data mining techniques (such as Random Forest, Bagging and Support Vector Machine).The output of these techniques help software engineers to mine patterns and detect violation of patterns, which are likely to be defects.Through data mining, data are converted into knowledge that can help in conducting the most common software engineering tasks: programming, defect detection, testing and maintenance [5].In literature, there are many different studies that deal with software quality prediction and data mining techniques.However, to the best of our knowledge, there is no a comprehensive study that explains the practical aspects of software analytics models [6].This study aims at providing an initial comparative performance analysis of different data mining techniques for software quality prediction through a well-documented methodology.Due to the amount of data, this paper provides a subset of results, whose discussion is going to be published in a forthcoming paper.
The remainder of this paper is structured as follows.Section 2 summarizes our research methodology; section 3 describes the study setup; section 4 provides some of the collected results with a brief discussion; finally section 5 draws our conclusions.

Research Methodology
Our approach is composed of two steps.In the first step, we have conducted a research in the field of data mining for software engineering issues and, more in detail, about software quality prediction to identify defect-prone software modules.In the second step, we have attempted to reproduce and expand previous studies on data mining techniques comparison by selecting a subset of software metrics, online datasets, data mining techniques, free data mining tools and packages, and performance criteria.
We collected the most used data mining techniques and metrics by leveraging existing literature.Support Vector Machine (e.g.SMO): is a supervised techniques that searches for the optimal hyperplane to separate training data.The hyperplane found is intuitive: it is the one which is maximally distant from the two classes of labelled points located in each side [7,8].Decision Tree (e.g.J48): is a flow-chart like tree structure.It is composed of: nodes which represent a test on a attribute value; branches which show the outcome of the tests; leaves, that indicate the resulting classes [3].Naive Bayes: relies on the Bayesian rule of conditional probability.It assumes that all the attributes are independent and analyses each of them individually [9].Ensemble Classifier (e.g.Random Forest): consists of training multiple classifiers and then combining their predictions [10].This technique leads to a generalized improvement of the ability of each classifier [11].According to the way the component classifiers are trained, parallel or sequential, we can distinguish two different categories of ensemble.Bagging [12] and Random Forest [13] are both parallel classifiers.Bagging creates multiple version of the classifier by replicating the learning set in parallel from the original on and the final decision is made by majority voting strategy.Random Forest adopts a combination of tree predictors, each depending on the values of a random vector sampled independently and with the same distribution for all trees in the forest.Adaboost [14] is an example of a sequential classifier since each classifier of this technique is applied sequentially on the training samples misclassified by the previous one.Deep Learning: is applied to feature hierarchy where features of higher levels are formed by the composition of lower level ones.Deep learning techniques leverage learning intermediate representations that can be shared across tasks and, as a consequence, they can exploit unsupervised data and data from similar tasks to improve performance on problems characterised by scarcity of labelled data [15][16][17].
As concerns metrics we collected all the metrics used in literature over time, some of them are: McCabe (e.g.Cyclomatic Complexity, Essential Complexity): is used to evaluate the complexity of a software program.It is derived from a flow graph and is mathematically computed using graph theory.Basically, it is determined by counting the number of decision statements in a program [18,19].
Halstead (e.g.Base Measures, Derived Measures): is used to measure some characteristics of a program module -such as the "Length", the "Potential Volume", "Difficulty", the "Programming Time" -by employing some basic metrics like number of unique operators, number of unique operands, total occurrences of operators, total occurrences of operands [20,21].Size (e.g.Lines of Code, Comment Lines of Code): the Lines of Code (LOC) is used to measure a software module and the accumulated LOC of all the modules for measuring a program [22].Chidamber and Kemerer (e.g.Number of Children, Depth of Inheritance): is used for object-oriented programs and is the most popular for performing software analysis and prediction.It has been adopted by many software tool vendors and computer scientists [23,24].Some metrics of the suite are: Weighted Method Per Class, which measures the number of methods which is in each class; Depth of Inheritance Tree, which measures the distance of the longest path from a class to the root in the inheritance tree; Number Of Children, which measures the number of classes that are direct descendants of each class.

Study Setup
Unlike previous literature, we also consider Deep Learning techniques [15,25], which have gained importance in recent years.In the past, their employment have mainly been on Computer Vision, Natural Language Processing and Speech Recognition [26].
In the past, authors have focused their attention mainly on the NASA Defect Dataset [27][28][29][30][31][32] that can be found in online repositories [33,34].On the other hand, we have decided to widen our scope by including some datasets related to open source projects such as Eclipse [35], Android and Elastic Search [36].Table 1 shows a summary of the most important characteristics of these datasets in terms of number of projects, metrics, modules and percentage of defective modules per projects, reporting their range whenever possible.We have collected the performance criteria used by previous literature.All the definitions below (see Eqs. 1, 2, 3, 4) are based on the confusion matrix shown in Table 2. Accuracy (see Eq. 1) is the percentage of modules correctly classified as either faulty or non-faulty.Precision (see Eq. 2) is the percentage of modules classified as faulty that are actually faulty.Recall (see Eq. 3) or Completeness is the percentage of faulty modules that are predicted as faulty.Mean Absolute Error determines how close the values of predicted and actual fault rate differ.F-measure (see Eq. 4) is a combined measure of recall and precision, the higher value of this indicator the better is the quality of the learning method for software prediction.

Accuracy = T P + T N T P + T N + FP + FN (1) Precision = T P FP + T P
(2) In this paper, we have provided results for the only accuracy performance indicator.A more extensive analysis is going to be included in a forthcoming paper.
Figure 2 describes our solution workflow.We have selected some online datasets (i.e., NASA, Eclipse, Android and Elastic Search), including all the metrics contained in those repositories.We have performed some cleaning operations when needed, replacing missing values with the mean of the other values related to the same metric [27].The obtained data have been used as input of many data mining techniques (both supervised and unsupervised) by employing three different free open source tools: Weka [37], scikit learn [38] and R [39].We have collected the output of all the executions of the algorithms and we have compared their values according to the performance indicators.

Initial Assessment
The histograms show the average accuracy computed for the considered datasets of the data mining techniques taken into account.Each value has been obtained by computing the accuracy of each dataset.The compared data mining techniques are: Naive Bayes, Multi Layer Perceptor, Support Vector Machine, AdaBoost, Bagging, Random Forest, J48, K-Nearest Neighbor, RBF Neural Network, DeepLearning4j.

Conclusion
In this study, we have shown an initial comparison of data mining techniques in the context of software defect prediction.To achieve this goal, we have leveraged existing literature to  By analysing the results, we can conclude that Bagging and Random Forest have achieved the best average accuracy over all the datasets.We have also shown that data mining can constitute a valid helping hand in determining and predicting software quality, and can be used together with statistical analysis.
Currently, we are experimenting using the same techniques on software used in HEP.
This research was supported by INFN CNAF.

Appendix A Glossary
Software Quality, according to IEEE, is the degree to which a system meets specified requirements or customer or user's needs or expectations.According to ISO, quality is the degree to which a set of inherent characteristics fulfils requirements.Data Mining is the process of discovering interesting patterns and knowledge from large amount of data contained in datasets.
Defect is an imperfection or deficiency in a work product where that work product does not meet its requirements or specifications and needs to be repaired or replaced.

Figure 3 .
Figure 3. Average Accuracy for the NASA and Eclipse datasets

Figure 4 .
Figure 4. Average Accuracy for the Android and ElasticSearch datasets

Table 1 .
Summary of the datasets employed