The effects of change decomposition on code review—a controlled experiment

Background Code review is a cognitively demanding and time-consuming process. Previous qualitative studies hinted at how decomposing change sets into multiple yet internally coherent ones would improve the reviewing process. So far, literature provided no quantitative analysis of this hypothesis. Aims (1) Quantitatively measure the effects of change decomposition on the outcome of code review (in terms of number of found defects, wrongly reported issues, suggested improvements, time, and understanding); (2) Qualitatively analyze how subjects approach the review and navigate the code, building knowledge and addressing existing issues, in large vs. decomposed changes. Method Controlled experiment using the pull-based development model involving 28 software developers among professionals and graduate students. Results Change decomposition leads to fewer wrongly reported issues, influences how subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the number of found defects. Conclusions Change decomposition reduces the noise for subsequent data analyses but also significantly supports the tasks of the developers in charge of reviewing the changes. As such, commits belonging to different concepts should be separated, adopting this as a best practice in software engineering.


INTRODUCTION
Code review is the activity performed by software teams to check code quality, with the purpose of identi-29 fying issues and shortcomings (Bacchelli and Bird, 2013). Nowadays, reviews are mostly performed in an 30 iterative, informal, change-and tool-based fashion, also known as Modern Code Review (MCR) (Cohen, 31 2010). Both open-source and industry software teams employ MCR to check code changes before being 32 integrated in their codebases (Rigby and Bird, 2013). Past research has provided evidence that MCR is 33 associated with improved key software quality aspects, such as maintainability (Morales et al., 2015) and and asked for tools to automatically decompose them (Tao et al., 2012; Barnett et al., 2015). Accordingly, change untangling mechanisms have been proposed (Tao and  In this paper, we continue on this research line and focus on evaluating the effects of change decom-54 position on code review. We aim at answering questions, such as: Is change decomposition beneficial for 55 understanding the rationale of the change? Does it have an impact on the number/types of issues raised? 56 Are there differences in time to review? Are there variations with respect to defect lifetime? 57 To this end, we designed a controlled experiment focusing on pull requests, a widespread approach to 58 submit and review changes . Our work investigates whether the results from Tao and 59 Kim (Tao and Kim, 2015) can be replicated, and extend the knowledge on the topic. With a Java system 60 as a subject, we asked 28 software developers among professionals and graduate students to review a 61 refactoring and a new feature (according to professional developers (Tao et al., 2012), these are the most 62 difficult to review when tangled). We measure how the partitioning vs. non-partitioning of the changes 63 impacts defects found, false positive issues, suggested improvements, time to review, and understanding 64 the change rationale. We also perform qualitative observations on how subjects conduct the review and 65 address defects or raise false positives, in the two scenarios. 66 This paper makes the following contributions: 67 • the design of an asynchronous controlled experiment to assess the benefits of change decomposition 68 in code review using pull requests, available for replication (di Biase et al., 2018); 69 • empirical evidence that change decomposition in the pull-based review environment leads to fewer 70 false positives. 71 The paper proceeds as follows: Section 2 illustrates the related work; Section 3 describes our research 72 objectives; the design of our experiment is described in Section 4; threats to validity are discussed in 73 Section 5; results are presented in Section 6; Section 7 reports the discussion based on the results; finally, 74 Section 8 summarizes our study. 76 Several studies explored tangled changes and concern separation in code reviews.  vestigated the role of understanding code changes during the software development process, exploring 78 practitioners' needs (Tao et al., 2012). Their study outlined that grasping the rationale when dealing with 79 the process of code review is indispensable. Moreover, to understand a composite change, it is useful with the presence of multiple concepts in a single code change (Kirinuki et al., 2014). They showed that 84 these are unsuitable for merging code from different branches, and that tangled changes are different to 85 review because practitioners have to seek the changes for the specified task in the commit.

86
Regarding empirical controlled experiments on the topic of code reviews, the most relevant work is 87 by Uwano et al. (2006). They used an eye-tracker to characterize the performance of subjects reviewing 88 source code. Their experimentation environment enabled them to identify a pattern called scan, consisting 89 of the reviewer reading the entire code before investigating the details of each line. In addition, their 90 qualitative analysis found that participants who did not spend enough time during the scan took more 91 time to find defects. Uwano's experiment was replicated by Sharif et al. (2012). Their results indicated 92 that the longer participants spent in the scan, the quicker they were able to find the defect. Conversely, Given the structure and the settings of our experimentation, we can also measure the time spent on 143 review activity and defect lifetime.

160
In this section, we detail how we designed the experiment and the research method that we followed. 4.1 Object system chosen for the experiment 162 The system that was used for reviews in the experiment is JPacman, an open-source Java system available 163 on GitHub 1 that emulates a popular arcade game used at Delft University of Technology to teach software 164 testing.

165
The system has about 3,000 lines of code and was selected because a more complex and larger project 166 would require participants to grasp the rationale of a more elaborate system. In addition, the training phase 167 required for the experiment would imply hours of effort, increasing the consequent fatigue that participants 168 might experience. In the end, the experiment targets assessing differences in review partitioning and is 169 tailored for a process rather than a product.

Recruiting of the subject participants
The study was conducted with 28 participants recruited by means of convenience sampling ( to have a larger sample for our study and increase its external validity. Using a questionnaire, we asked 178 development experience, language-specific skills, and review experience as number of reviews per week. 179 We also included a question that asked if a participant knew the source code of the game. were provided with instructions on how to use the virtual machine, but no time window was set. The independent variable of our study is change decomposition in pull requests. We split our subjects

5/17
this value is closer to the minimum rather than the median for similar experiments (Ko et al., 2015). As 210 stated before, though, we did not suggest or force any strict limit on the duration of the experiment to the 211 ends of replicating the code review informal scenario. No learning effect is present as every participant 212 runned the experiment only once.  214 We ran two pilot experiments to assess the settings. The first subject (a developer with 5 FTE 3 years of 215 experience) took too long to complete the training and showed some issues with the virtual machine. Con-216 sequently, we restructured the training phase addressing the potential environment issues in the material    2 -Training the participants. Before starting with the review phase, we first ensured that the participants 228 were sufficiently familiar with the system. It is likely that the participants had never seen the codebase 229 before: this situation would limit the realism of the subsequent review task.

230
To train our participants we asked subjects to implement three different features in the system:  This learning by doing approach is expected to have higher effectiveness than providing training material 237 to participants (Slavin, 1987). By definition, this approach is a method of instruction where the focus is 238 on the role of feedback in learning. The desired features required change across the system's codebase.

239
The third feature to be implemented targeted the classes and components of the game that would be object 240 of the review tasks. The choice of using this feature as the last one is to progressively increment the level 241 of difficulty.

242
No time window was given to participants, aiming for a more realistic scenario. As explicitly 243 mentioned in the provided instructions, participants were allowed to use any source for retrieving 244 information about something they did not know. This was permitted as the study does not want to assess 245 skills in implementing some functionality in a programming language. The only limitation is that the 246 participants must use the tools within the virtual machine.   given on a Likert scale from "Strongly disagree" (1) to "Strongly agree" (5) (Oppenheim, 2000), reported 320 as mean, median and standard deviation over the two groups, and tested for statistical significance with 321 the Mann-Whitney U-test. To code the transcriptions, we used the deductive category application, resembling the data-driven content analysis technique by Mayring (2000). We read the material transcribed, checking whether a 341 concept covers that action transcribed (e.g, participant opens file fileName so that (s)he is looking for 342 context). We grouped actions covered by the same concept (e.g, a participant opens three files, but always 343 for context purpose) and continued until we built a pattern that led to a specific outcome (i.e., addressing 344 a defect or a false positive). We split the patterns according to their concept ordering such that those that 345 led to more defects found or false positive issues were visible. and, at worst, deleterious to sound statistical inference" (Perneger, 1998 phase. Therefore, we could not control their behavior based on the guesses that either positively or 365 negatively affected the outcome.

366
Finally, we acknowledge threats to construct validity when designing the questionnaires used for RQ3, 367 despite designed using standard ways and scales (Oppenheim, 2000).

368
External validity Threats to external validity for this experiment concern the selection of participants to 369 the experimentation phase. Volunteers selected with convenience sampling could have an impact on the 370 generalizability of results, which we tried to mitigate by sampling multiple roles for the task. If the group 371 is very heterogeneous, there is a risk that the variation due to individual differences is larger than due to 372 the treatment (Cook and Campbell, 1979).

373
Furthermore, we acknowlegde and discuss the possible threat regarding the system selection for the 374 experimental phase. Naturally, the system used is not fully representative of a real-world scenario. Our      have to deal with less code that concerns a single concept, rather than having to extrapolate context 432 information from a tangled change. At the same time the treatment group is taking longer (median) to 433 address the second defect. We believe that this is due to the presence of two pull requests, and as such, the 434 context switch has an overhead effect on that. From the screencast recordings we found no reviewer using 435 multi-screen setup, therefore subjects had to close a pull-request and then review the next, where they 436 need to gain knowledge on different code changes.

437
Result 2: Our experiment was not able to provide evidence for a difference in net review time between untangled pull requests (treatment) and the tangled one (control); this despite the additional overhead of dealing with two separate pull requests in the treatment group.

438
For our third research question, we seek to measure whether subjects are affected by the dependent  (Table 4) and Figure 2 reports the results. Higher scores for Q1, Q2, and Q4 mean better understanding,

441
whereas for Q3 a lower score signifies a correct understanding. As for the previous research questions, 442 we test our hypothesis with a non-parametrical statistical test. Given the result we cannot reject the null 443 hypothesis H 0u of tangled pull requests reducing change understanding. Participants are in fact able to 444 answer the questions correctly, independent of their experimental group.

445
After the review, our experimentation also provided a final survey (Q5 to Q12 in Table 4) that 446 participants filled in at the end. Results shown in Figure 2 indicate that subjects judge equally the 447 changeset (Q5), found no difficulty in understanding the changeset (Q6), agree on having understood the 448 rationale behind the changeset (Q7). This results shows that our experiment cannot provide evidence of 449 differences in change understanding between the two groups.

450
Participants did not find the changeset hard to navigate (Q9), and believe that the changeset was  The answers are different with a statististical significance for Q8, Q10 and Q12.

459
Result 3: Our experiment was not able to provide evidence of a difference in understanding the rationale of the changeset between the experimental groups. Subjects reviewing the untangled pull requests (treatment) recognize the benefits of untangled pull requests, as they evaluate the changeset as being (1) better divided according to a logical separation of concerns, (2) better structured, and (3) not spanning too many features. The changeset was functionally correct Q6 I found no difficulty in understanding the changeset Q7 The rationale of this changeset was perfectly clear Q8 * The changeset [showed] a logical separation of concerns Q9 Navigating the changeset was hard Q10 * The relations among the changes were well structured Q11 The changeset was comprehensible Q12 * Code changes were spanning too many features

Concept
Mapped keyword What is the rationale behind this code change? (Tao et al., 2012) Rationale Is this change correct? Does it work as expected? (Tao et al., 2012) Correctness Who references the changed classes/methods/fields? (Tao et al., 2012) Context How does the caller method adapt to the change of its callees? (Tao et al., 2012) Caller/Callee Is there a precedent or exemplar for this? (Sillito et al., 2006) Similar/Precedent RQ4. Tangled vs. Untangled review patterns 460 For our last research question, we seek to identify differences in patterns and features during review, 461 and their association to quantitative results. We derived such patterns from Tao Table 5. participants were not core developers of the considered software system; it is possible that core developers 498 would be more surprised by tangled changes, find them more convoluted or less "natural," thus rejecting 499 them (Hellendoorn et al., 2015). We did not investigate these scenarios further, but studies can be designed 500 and carried out to determine whether and how these aspects influence the results of the code review effort.  Moreover, similarly to us, Tao and Kim also did not find difference with respect to time to completion in 511 their preliminary user study (Tao and Kim, 2015). Further studies should be designed to replicate our 512 experiment and, if results are confirmed, to derive a theory on why there is no reduction in review time.

513
Our initial hypothesis on why time does not decrease with untangled code changes is that reviewers 514 of untangled changes (control) may be more willing to build a more appropriate context for the change.

515
This behavior seems to be backed up by our qualitative analysis (Section 6), through the context-seeking 516 actions that we witnessed for the treatment group. If our hypothesis is not refused by further research, 517 this could indicate that untangled changes may lead to a more thorough low-level understanding of the 518 codebase. Despite we did not measure this in the current study, it may explain the lower number of

522
Our experiment is not able to show no negative effects when changes are presented as separate, untangled 523 changesets, despite the fact that reviewers have to deal with two pull requests instead of one, with 524 the subsequent added overhead and a more prominent context switch. With untangled changesets, our 525 experiment highligthed an increased number of suggested improvements, more context-seeking actions 526 (which, it is reasonable to assume, increase the knowledge transfer created by the review), and a lower 527 number of wrongly reported issues.

528
For the aforementioned reasons, we support the recommendation that change authors prepare self-529 contained, untangled changeset when they need a review. In fact, untangled changesets are not detrimental 530 to code review (despite the overhead of having more pull-requests to review), but we found evidence of 531 positive effects. We expect the untangling of code changes to be minimal in terms of cognitive effort and 532 time for the author. This practice, in fact, is supported by answers Q8, Q10, Q12 to the questionnaire and The goal of the study presented in this paper is to investigate the effects of change decomposition on mod-537 ern code review (Cohen, 2010), particularly in the context of the pull-based development model (Gousios 538 et al., 2014). 539 We involved 28 subjects, who performed a review of pull request(s) pertaining to (1) a refactoring 540 and (2) the addition of a new feature in a Java system. The control group received a single pull request 541 with both changes tangled together, while the treatment group received two pull requests (one per type of 542 change). We compared control and treatment groups in terms of effectiveness (number of defects found), 543 number of false positives (wrongly reported issues), number of suggested improvements, time to complete 544 the review(s), and level of understanding the rationale of the change. Our investigation involved also a 545 qualitative analysis of the review performed by subjects involved in our study.