Too trivial to test? An inverse view on defect prediction to identify methods with low fault risk

Background Test resources are usually limited and therefore it is often not possible to completely test an application before a release. To cope with the problem of scarce resources, development teams can apply defect prediction to identify fault-prone code regions. However, defect prediction tends to low precision in cross-project prediction scenarios. Aims We take an inverse view on defect prediction and aim to identify methods that can be deferred when testing because they contain hardly any faults due to their code being “trivial”. We expect that characteristics of such methods might be project-independent, so that our approach could improve cross-project predictions. Method We compute code metrics and apply association rule mining to create rules for identifying methods with low fault risk (LFR). We conduct an empirical study to assess our approach with six Java open-source projects containing precise fault data at the method level. Results Our results show that inverse defect prediction can identify approx. 32–44% of the methods of a project to have a LFR; on average, they are about six times less likely to contain a fault than other methods. In cross-project predictions with larger, more diversified training sets, identified methods are even 11 times less likely to contain a fault. Conclusions Inverse defect prediction supports the efficient allocation of test resources by identifying methods that can be treated with less priority in testing activities and is well applicable in cross-project prediction scenarios.


INTRODUCTION
In a perfect world, it would be possible to completely test every new version of a software application 27 before it was deployed into production. In practice, however, software development teams often face a 28 problem of scarce test resources. Developers are busy implementing features and bug fixes, and may lack 29 time to develop enough automated unit tests to comprehensively test new code [Ostrand et al. (2005); 30 Menzies and Di Stefano (2004)]. Furthermore, testing is costly and, depending on the criticality of a 31 system, it may not be cost-effective to expend equal test effort to all components [Zhang et al. (2007)]. 32 Hence, development teams need to prioritize and limit their testing scope by restricting the code regions 33 to be tested [Menzies et al. (2003); Bertolino (2007)]. To cope with the problem of scarce test resources, 34 development teams aim to test code regions that have the best cost-benefit ratio regarding fault detection. 35 To support development teams in this activity, defect prediction has been developed and studied extensively code regions that are likely to contain a fault and should therefore be tested [Menzies et al. (2007); 38 Weyuker and Ostrand (2008)]. 39 This paper suggests, implements, and evaluates another view on defect prediction: inverse defect 40 prediction (IDP). The idea behind IDP is to identify code artifacts (e.g., methods) that are so trivial that 41 they contain hardly any faults and thus can be deferred or ignored in testing. Like traditional defect 42 prediction, IDP also uses a set of metrics that characterize artifacts, applies transformations to pre-process 43 metrics, and uses a machine-learning classifier to build a prediction model. The difference rather lies in 44 the predicted classes. While defect prediction classifies an artifact either as buggy or non-buggy, IDP 45 identifies methods that exhibit a low fault risk (LFR) with high certainty and does not make an assumption 46 about the remaining methods, for which the fault risk is at least medium or cannot be reliably determined. 47 As a consequence, the objective of the prediction also differs. Defect prediction aims to achieve a high 48 recall, such that as many faults as possible can be detected, and a high precision, such that only few false 49 positives occur. In contrast, IDP aims to achieve high precision to ensure that low-fault-risk methods 50 contain indeed hardly any faults, but it does not necessarily seek to predict all non-faulty methods. Still, 51 IDP needs to achieve a certain recall such that a reasonable reduction potential arises when treating LFR 52 methods with a lower priority in QA activities. 53 Research goal: We want to study whether IDP can reliably identify code regions that exhibit only a 54 low fault risk, whether ignoring such code regions-as done silently in defect prediction-is a good idea, 55 and whether IDP can be used in cross-project predictions. 56 To implement IDP, we calculated code metrics for each method of a code base and trained a classifier 57 for methods with low fault risk using association rule mining. To evaluate IDP, we performed an empirical 58 study with the Defects4J dataset [Just et al. (2014)] consisting of real faults from six open-source projects. 59 We applied static code analysis and classifier learning on these code bases and evaluated the results. We 60 hypothesize that IDP can be used to pragmatically address the problem of scarce test resources. More 61 specifically, we hypothesize that a generalized IDP model can be used to identify code regions that can be 62 deferred when writing automated tests if none yet exist, as is the situation for many legacy code bases.

63
Contributions: 1) The idea of an inverse view on defect prediction: While defect prediction has 64 been studied extensively in the last decades, it has always been employed to identify code regions with 65 high fault risk. To the best of our knowledge, the present paper is the first to study the identification of in the code bases and an indication whether they were changed in a bug-fix patch, a list of methods that 70 changed in bug fixes only to preserve API compatibility, and association rules to identify low-fault-risk 71 methods.

72
The remainder of this paper is organized as follows. Section 2 provides background information about 73 association rule mining. Section 3 discusses related work. Section 4 describes the IDP approach, i.e., the 74 computation of the metrics for each method, the data pre-processing, and the association rule mining to 75 identify methods with low fault risk. Afterwards, Section 5 summarizes the design and results of the IDP 76 study with the Defects4J dataset. Then, Section 6 discusses the study's results, implications, and threats 77 to validity. Finally, Section 7 summarizes the main findings and sketches future work.      Our work differs from the above-mentioned work in the target setting: we do not predict artifacts that 135 are fault-prone, but instead identify artifacts (methods) that are very unlikely to contain any faults. While 136 defect prediction aims to detect as many faults as possible (without too many false positives), and thus 137 strives for a high recall [Mende and Koschke (2009)], our IDP approach strives to identify those methods 138 that are not fault-prone to a high certainty. Therefore, we optimized our approach towards the precision in 139 detecting low-fault-risk methods and considered the recall as less important. To the best of our knowledge, 140 this is the first work to study low-fault-risk methods. Moreover, as far as we know, cross-project prediction 141 has not yet been applied at the method level. To perform the classification, we applied association rule

146
This section describes the inverse defect prediction approach, which identifies low-fault-risk (LFR) 147 methods. The approach comprises the computation of source-code metrics for each method, the data 148 pre-processing before the mining, and the association rule mining. Figure 1 illustrates the steps.

150
Like defect prediction models, IDP uses metrics to train a classifier for identifying low-fault-risk methods.

151
For each method, we compute the source-code metrics listed in Table 1   Next, we derive further metrics from the existing ones. They are redundant, but correlated metrics 172 do not have any negative effects on association rule mining (except on the computation time) and may 173 improve the results for the following reason: if an item generated from a metric is not frequent, rules with 174 this item will be discarded because they cannot achieve the minimum support; however, an item for a 175 more general metric may be more frequent and survive. The derived metrics are: Furthermore, we compute to which of the following categories a method belongs (a method can 181 belong to zero, one, or more categories): • Constructors: Special methods that create and initialize an instance of a class. They might be less 183 fault-prone because they often only set class variables or delegate to another constructor.

184
• Getters: Methods that return a class variable. They usually consist of a single statement and can be 185 generated by the IDE.

186
• Setters: Methods that set the value of a class variable. They usually consist of a single statement and 187 can be generated by the IDE.

188
• Empty Methods: Non-abstract methods without any statements. They often exist to meet an imple-189 mented interface, or because the default logic is to do nothing and is supposed to be overridden in 190 certain sub-classes.

191
• Delegation Methods: Methods that delegate the call to another method with the same name and further 192 parameters. They often do not contain any logic besides the delegation.

193
• ToString Methods: Implementations of Java's toString method. They are often used only for 194 debugging purposes and can be generated by the IDE.

195
Note that we only use source-code metrics and do not consider process metrics. This is because we 196 want to identify methods that exhibit a low fault risk due to their code.

197
Association rule mining computes frequent itemsets from categorical attributes; therefore, our next 198 step is to discretize the numerical metrics. (In defect prediction, discretization is also applied to the

205
• For all count metrics (including the derived ones), we create a binary "has-no"-metric, which is true if 206 the value is zero, e.g., CountLoops = 0 ⇒ NoLoops = true.

207
• For the method categories (setter, getter, . . . ), no transformation is necessary as they are already binary. Prior to applying the mining algorithm, we have 1) to address faulty methods with multiple occurrences, 215 2) to create a unified list of faulty and non-faulty methods, and 3) to tackle dataset imbalance. datasets, many non-expressive rules will be generated when most methods are not faulty. For example, 233 if 95% of the methods are not faulty and 90% of them contain a method invocation, rules with high 234 support will be generated that use this association to identify non-faulty methods. Balancing avoids those 235 nonsense rules. To determine n, we compute the maximum number of rules until the faulty methods in the low-fault-risk 247 methods exceed a certain threshold in the training set.

248
Of course, IDP can also be used with other machine-learning algorithms. We decided to use association 249 rule mining because of the natural comprehensibility of the rules (see Section 2) and because we achieved 250 a better performance compared to models we trained using Random Forest.

252
This section reports on the empirical study that we conducted to evaluate the inverse defect prediction 253 approach.

255
We investigate the following questions to research how well methods that contain hardly any faults can be 256 identified and to study whether IDP is applicable in cross-project scenarios.

257
RQ 1: How many faults do methods classified as "low fault risk" contain? To evaluate the 258 precision of the classifier, we investigate how many methods that are classified as "low-fault-risk" (due 259 to the triviality of their code) are faulty. If we want to use the low-fault-risk classifier for determining 260 methods that require less focus during quality assurance (QA) activities, such as testing and code reviews, 261 we need to be sure that these methods contain hardly any faults.

262
RQ 2: How large is the fraction of the code base consisting of methods classified as "low fault 263 risk"? We study how common low-fault-risk methods are in code bases to find out how much code is of 264 lower importance for quality-assurance activities. We want to determine which savings potential can arise 265 if these methods are excluded from QA.

291
Defects4J provides for each project a set of reverse patches 2 , which represent bug fixes. To obtain the 292 list of methods that were at least once faulty, we conducted the following steps for each patch. First, we 293 checked out the source code from the project repository at the original bug-fix commit and stored it as 294 fixed version. Second, we applied the reverse patch to the fixed version to get to the code before the bug 295 fix and stored the resulting faulty version. 296 Next, we analyzed the two versions created for every patch. For each file that was changed between 297 the faulty and the fixed version, we parsed the source code to identify the methods. We then mapped the 298 code changes to the methods to determine which methods were touched in the bug fix. After that, we had 299 the list of faulty methods. Figure 2 summarizes these steps. 300 We inspected all 395 bug-fix patches and found that 10 method changes in the patches do not represent 301 bug fixes. While the patches are minimal, such that they contain only bug-related changes (see Section 5.2), 302 these ten method changes are semantic-preserving, only necessary because of changed signatures of other 303 methods in the patch, and therefore included in Defects4J to keep the code compilable. Figure 3 Table 3 presents the value ranges of the resulting classes. The 319 classes are the same for all six projects. 320 We then aggregated multiple faulty occurrences of the same method (this occurs if a method is 321 changed in more than one bug-fix patch) and created a unified dataset of faulty and non-faulty methods

323
Next, we split the dataset into a training and a test set. For RQ 1 and RQ 2, we used 10-fold cross- is used once for testing the classifier, which is trained on the remaining nine partitions. To compute the 327 3 code without sample and test code 4 http://www.eclipse.org/jdt/ 5 https://cran.r-project.org/ 6 We did not use the ntile function to create classes, because it always generates classes of the same size, such that instances with the same value may end up in different classes (e.g., if 50% of the methods have the complexity value 1, the first 33.3% will end up in class 1, and the remaining 16.7% with the same value will end up in class 2).   (i.e., non-generalizable) rules from being created, and the minimum confidence prevents the creation of 339 imprecise rules. Note that no rule (with NotFaulty as rule consequent) can reach a higher support than 340 50% after the SMOTE pre-processing. After computing the rules, we removed redundant ones using the 341 corresponding function from the apriori package. We then sorted the remaining rules descending by their 342 confidence.

343
Using these rules, we created two classifiers to identify low-fault-risk (LFR) methods. They differ in   To answer RQ 1, we used 10-fold cross-validation to evaluate the classifiers separately for each 353 project. We computed the number and proportion of methods that were classified as "low-fault-risk" but 354 contained a fault (≈ false positives). For the sake of completeness, we also computed precision and recall; 355 although, we believe that the recall is of lesser importance for our purpose. This is because we do not 356 7 We computed the results for the empirical study once with and once without addressing the data imbalance in the training set. The prediction performance was better when applying SMOTE, therefore, we decided to use it.    To answer RQ 3, we computed the association rules for each project with the methods of the other 375 five projects as training data. Like in RQ 1 and RQ 2, we determined the number of used top n rules 376 with the same thresholds (2.5% and 5%). To allow a comparison with the within-project classifiers, we 377 computed the same metrics like in RQ 1 and RQ 2.

379
This section presents the results to the research questions. The data to reproduce the results is available  methods is 0.3% resp. 0.4%.

387
The fault-density reduction factor for the stricter classifier ranges between 4.3 and 10.9 (median: 5.7) 388 when considering methods and between 1.5 and 4.4 (median: 3.2) when considering SLOC. In the project 389 Lang, 28.6% of all methods with 13.8% of the SLOC are classified as LFR and contain 4.1% of all faults, 390 thus, the factor is 7.0 (SLOC-based: 3.4). The factor never falls below 1 for both classifiers.

391
IDP can identify methods with low fault risk. On average, only 0.3% of the methods classified as "low fault risk" by the strict classifier are faulty. The identified LFR methods are, on average, 5.7 times less likely to contain a fault than an arbitrary method in the dataset. Using within-project IDP, on average, 32-44% of the methods, comprising about 15-20% of the SLOC, can be assigned a lower importance during testing.

401
In the best case, when ignoring 16.5% of the methods (4.8% of the SLOC), it is still possible to catch 98.5% of the faults (Math).

402
RQ 3: Is a trained classifier for methods with low fault risk generalizable to other projects? 403 Table 6 presents the results for the cross-project prediction with training data from the respective other   operations, which are often wrapped in loops or conditions; most of the faults are located in these methods.

437
Therefore, the within-project classifiers used few, very precise rules for the identification of LFR methods.

438
To sum up, our results show that the IDP approach can be used to identify methods that are, due 439 to the "triviality" of their code, less likely to contain any faults. Hence, these methods require less 440 focus during quality-assurance activities. Depending on the criticality of the system and the risk one 441 is willing to take, the development of tests for these methods can be deferred or even omitted in case 442 of insufficient available test resources. The results suggest that IDP is also applicable in cross-project 443 prediction scenarios, indicating that characteristics of low-fault-risk methods differ less between projects 444 than characteristics of faulty methods do. Therefore, IDP can be used in (new) projects with no (precise) 445 historical fault data to prioritize the code to be tested. Next, we discuss the threats to internal and external validity.

465
Therefore, the quality of our data depends on the quality of Defects4J. to other languages, the collected metrics and the low-fault-risk classifier need to be validated and adjusted.

491
Other languages may use language constructs in a different way or use constructs that do not exist in 492 Java. For example, a classifier for the C language should take constructs such as GOTOs and the use of  Developer teams often face the problem scarce test resources and need therefore to prioritize their testing 502 efforts (e.g., when writing new automated unit tests). Defect prediction can support developers in this 503 activity. In this paper, we propose an inverse view on defect prediction (IDP) to identify methods that are 504 so "trivial" that they contain hardly any faults. We study how unerringly such low-fault-risk methods can 505 be identified, how common they are, and whether the proposed approach is applicable for cross-project 506 predictions. 507 We show that IDP using association rule mining on code metrics can successfully identify low-fault-508 risk methods. The identified methods contain considerably fewer faults than the average code and can 509 provide a savings potential for QA activities. Depending on the parameters, a lower priority for QA can 510 be assigned on average to 31.7% resp. 44.1% of the methods, amounting to 14.8% resp. 20.0% of the

15/19
For future work, we want to replicate this study with closed-source projects, projects of other 517 application types, and projects in other programming languages. It is also of interest to investigate which 518 metrics and classifiers are most effective for the IDP purpose and whether they differ from the ones used 519 in traditional defect prediction. Moreover, we plan to study whether code coverage of low-fault-risk 520 methods differs from code coverage of other methods. If guidelines to meet a certain code coverage level 521 are set by the management, unmotivated testers may add tests for low-fault-risk methods first because it 522 might be easier to write tests for those methods. Consequently, more complex methods with a higher fault 523 risk may remain untested once the target coverage is achieved. Therefore, we want to investigate whether 524 this is a problem in industry and whether it can be addressed with an adjusted code-coverage computation, 525 which takes low-fault-risk methods into account.