Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on November 10th, 2020 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on December 4th, 2020.
  • The first revision was submitted on January 5th, 2021 and was reviewed by 3 reviewers and the Academic Editor.
  • A further revision was submitted on January 29th, 2021 and was reviewed by the Academic Editor.
  • The article was Accepted by the Academic Editor on February 1st, 2021.

Version 0.3 (accepted)

· Feb 1, 2021 · Academic Editor

Accept

The manuscript has been revised and has been accepted.

Version 0.2

· Jan 22, 2021 · Academic Editor

Minor Revisions

I am glad to inform you that pending MINOR revisions your paper will be accepted. The reviewers are satisfied with the previous revision and pinpointed only a few issues that should be addressed in this revision.

·

Basic reporting

The authors significantly improved the paper in accordance with my (and other reviewer's) comments. As I see no other issues left uncovered, I have no further objections.

Experimental design

The authors included the statistical analysis, as per my suggestion, and addressed all of the raised issues.

Validity of the findings

The additional explanations make the newest version of the paper higher in quality, thus the validity of the findings is not questionable anymore.

Additional comments

The authors significantly improved the paper in accordance with my (and other reviewer's) comments. As I see no other issues left uncovered, I have no further objections.

Reviewer 2 ·

Basic reporting

• Line 127 - MetricsReloaded seems to be doing fine as an IntelliJ plugin, last release was December 2020 (https://plugins.jetbrains.com/plugin/93-metricsreloaded)
• In Section 3.1, there is the possibility of detailing each RQ immediately after it was stated; this might further help readers maintain the current context and better understand the scope of the research; however, this is just a suggestion and deferred to the authors' judgement
• Line 218 - "speed of programming languages" might not best express the authors' intention; what is actually measured is how quickly code written in a given language can be executed after being translated/interpreted by the compiler/interpreter target architecture, as a matter of optimization.
• Line 239 - capitalize 'Figure'
• Lines 249 and 266 look a weird, as they start with a non-capital letter, and it's unclear whether by intention or the result of a formatting error.
• Line 272 should mention that Listing 1 is in the annex, as it's many pages away. Also, the listings might be put into a table with 2 columns and only take up 2 pages for brevity, or included in a data replication package on figshare/zenodo with their own ISBN.
* Line 493 'Whe' (typo)

Experimental design

no comment

Validity of the findings

no comment

Additional comments

The paper was revised according to the findings and suggestions resulting from the first review. The current version provides much more detail regarding the paper's main objective, the means employed to fulfil the stated objective as well as the methodology and threats to validity. Observations resulting from a careful reading can be found in the section 'Basic Reporting'.

Reviewer 3 ·

Basic reporting

I appreciate authors for incorporating the suggested changes in the manuscript.

1. Minor grammatical mistakes are still in the manuscript. Please check use of commas, articles, correct preposition usage. I have listed only few examples for your reference. Kindly crosscheck complete manuscript for correct grammar usage.

Line 48: "metrics under the categories of Size, Coupling, Complexity and Inheritance [7]." should be "metrics under the categories of Size, Coupling, Complexity, and Inheritance [7]."


Line 88: "ownership model to guarantee memory-safety and thread-safety; productivity, with integrated package" should be "ownership model to guarantee memory-safety and thread-safety; productivity, with an integrated package".


Line 109: "review, it is found that the following set of open-source tools is able to cover most of quality metrics" should be "review, it is found that the following set of open-source tools can cover most of quality metrics".

Line 144: "interpreting the results from developers and researchers standpoint." I think you missed to add apostrophe's in developers and researchers.

Line: 171 "lines or comments; the count, however, depends on the physical format of the statements and on programming". Omit "on" before programming. Repetitive use of On.

Line 223: "Table 7 lists the code artifacts used (sorted out alphabetically) and provides a brief description for each" should be "Table 7 lists the code artifacts used (sorted out alphabetically) and provides a brief description of each"


2. More than one references can be clubbed together.
for example:
Line 52: "code metrics to predict or infer the maintainability of a project [9], [10], [11]." may be rewritten as "code metrics to predict or infer the maintainability of a project [9-11]."

3. Earlier also, I suggested to go through all the references. Kindly comply with PeerJ reference format. Below is only one example, you need to recheck all references.

Line 855 17] Abhiram Balasubramanian, Marek S Baranowski, Anton Burtsev, Aurojit Panda, Zvonimir Rakamaric´, and Leonid Ryzhyk. System programming in rust: Beyond safety. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pages 156–161, 2017.
Line 858 [18] Vytautas Astrauskas, Peter Mu¨ller, Federico Poli, and Alexander J Summers. Leveraging rust types for modular specification and verification. Proceedings of the ACM on Programming Languages, 3(OOPSLA):1–30, 2019.

Experimental design

Authors have justified the work by revising the manuscript.

1. Statistical tests are now included . My concern is that
a) please state the hypothesis set for conducting the statistical tests.
b) if Wilcoxon test is used to compare multiple pairs, you need to use a p-value correction like a Bonferroni correction for the results to be reliable.

Validity of the findings

Work is novel. Good work done by the authors. Conclusion and future work is now in better shape. Supporting data is provided and conclusions are linked with original research questions.

Kindly modify Wilcoxon test with Bonferroni correction for reliability of the results, as I stated in experimental design.

Version 0.1 (original submission)

· Dec 4, 2020 · Academic Editor

Major Revisions

The three reviewers agree that the paper has some merits but they all raised relevant problems with the experimental design, namely the selection of the target applications and the lack of statistical analysis of the results. All reviewers provide several suggestions and remarks that should be taken into account in the revised version of the manuscript.

·

Basic reporting

The structure of the paper is appropriate, and the paper is easy to read and follow. The English language used in the paper is on par with only minor spelling mistakes, which can be mitigated with another round of proof-reading by the authors. The main contributions are appropriately highlighted in the introduction, making it easy for readers to determine an overview of the paper quickly.

-Please provide the reference for the statement in line 79 "...for example, software like Firefox, Dropbox, and Cloudflare use Rust."

-The manuscript is left-justified, not per the manuscript preparation instructions: "Left justify all text to the left margin. Do not 'full width' justify."

-The analysis for RQ1 should be supplemented with the reference that various lines of code metrics are the ones that are the most appropriate metrics for the programming language verbosity.

-On a similar note, what about the analysis classes for RQ2? Why only limit yourself to the methods?

-During the pipeline of the evaluation framework (Figure 1), please describe the differences of .json results after each step. It would be helpful to show an example of each JSON.

-It is described that compare.py is the main entry point of the source code files and results in the final .json with results. This is not evident from the pipeline in Figure 1 - here, compare.py is only one part of the whole pipeline, not the main script that runs other ones.

-How does tool rust-code-analysis analyze the C, C++, JavaScript, and other code? Is it not only meant for Rust code, as its name suggests? The quick review of the source code indicates that it analyses other languages, especially emphasizing this ability, so the readers do not get confused by its name.

-Be consistent with naming JSON (sometimes it appears as Json).

Experimental design

The experiment's design is appropriate, where various software metrics of the same software methods implemented in different languages are compared among themselves. The data used in the experiment and the source code used in the analysis are both publicly available. This makes the experiment repeatable.

-The main issue with the experiment is the lack of any statistical analysis of the results. There is enough data (enough analyzed software files) to make at least a basic statistical comparison. PeerJ CS is a high impacting journal, and thus, the methodology should be suitable. One could use ANOVA for repeated measurements or Friedman's ANOVA on every metric to find if the differences shown in the charts and table are due to chance (source of software or programmer making them) or they are statistically significant. If there are differences, use posthoc tests (i.e., Wilcoxon signed-rank with correction for multiple comparisons Holm-Bonferroni) to determine the real answers for posed research questions.

-Provide reasoning on why some of the programming languages were used in the experiment and others were not. Yes, it is mentioned that the implemented module only supports some of them, but why those. Are the chosen programming languages valid alternatives in some information systems (web assembly programming, etc.)? With this, you will introduce readers to Rust's typical applications and show what its main alternatives are.

Validity of the findings

The main issue with the findings was already mentioned in part about the experimental design - the lack of any in depath analysis of the results. Additionally, several other issues have to be addressed.

-The discussion on the differences of LLOC is a bit too narrow. Authors only explain the differences in this metric as the product of more types of logical statements available in Rust. In this case, Python would have the lowest LLOC count, which is not evident from the results. Also, the sheer number of available types does probably not correlate with the higher usage of those. If you argue that it does, please provide a reference or at least a viable justification. I would argue that there could be other reasons for more LLOC, which are fundamentally simple. For example, it could indicate that the logical statements are more elementary (do less in one call) than in other languages, so more are needed. Does this make code more verbose? Probably, but it's open for discussion. What if this is the key reason why the cyclomatic complexity and cognitive complexity are the highest with Rust?

-On a similar note, the number of methods and sum of arguments discrepancies could probably further explain Rust's method arguments' lack of default values.

-Also, is the higher count of methods the sign of better structure of the code? The answer to RQ2 suggests this. Again, this implies that using the one method without its variants (for different argument count) is superior. Please elaborate on this point.

Additional comments

The paper presents the results of an analysis of maintainability metrics and other software metrics software written in the programming language Rust. The authors took a publicly available repository of the different procedures written in various programming languages and compared them. The paper's main findings are lenient to the Rust programming language, as Rust results as the language in which software is written without too much complexity, is verbose, and is not too hard to maintain. The paper is derived from the final thesis, which (after my review and search) has never been published before.

Reviewer 2 ·

Basic reporting

- In section Introduction, the maintainability characteristic should be linked to well-known software quality standards such as ISO 9126, ISO 25010; they also provide the sub-characteristics for maintainability, allowing for a finer grained approach
- Introduction section could also use more recent references, as there exists a lot of post-2017 research on the topic.
- Rephrase line 46-47 to eliminate repetition
- Revise reference on line 101
- Line 125: "open-source algorithms" - perhaps this needs a bit of clarification; are they open-source algorithms or are the algorithms implemented in open-source code?
- Table 4 - the meaning for some of the formula terms (column Formula, N1 and N2 for example) remains unclear.
- Perhaps a reorganisation of the Tables on pages 4 - 6 would improve the paper's readability, as currently there is a 1.5 page gap in the article text.
- "NARGS and NEXITS are two software metrics defined by Mozilla and have no equivalent in the literature about maintainability metrics". In that case, what makes the authors employ these metrics for studying the maintainability characteristic?
- Lines 155-156 please recheck
- Figure 1, first two boxes (labeled 'Source code' and the one with file extensions) should be merged, as they make the idea clear; Also, perhaps it would be better to eliminate the .json boxes and represent the entire process on a single line; perhaps use 'JSON' as an annotation over the arrows, to show that was the selected format for data transfer.
- Lines 199-201 - it's not clear to me what this paragraph refers to; perhaps its intent could be further clarified by the authors
- Listing 1 does not improve the quality or understandability of the article; perhaps it would be best to include in the repository's GitHub readme and direct the reader to that using a suitable footnote.
- Lines 403-404, 435-436 refer the wrong Tables/Figures.
- Lines 437-438 - the temporal characteristic of the MI is not clear; changes in its value could be interpreted as a modification of maintainability, but the metric itself reports a singular value.

Experimental design

- The Paper should be structured according to existing best practices regarding case study research (e.g: Runeson and Host - Guidelines for conducting and reporting case study research in software engineering)
- The Maintainability Index was first elaborated for a number of C systems, and has come under strong criticism recently for not being able to adequately express the maintainability characteristic in newer paradigms (such as object oriented) and newer programming languages. While there is still merit in using it, authors should address the existence of relevant concerns. In addition, further explanation is required regarding the different forms employed for the MI. This, together with the selection of rather simple metrics to assess maintainability raises issues regarding the accuracy of the authors' measurements and their validity.
- I am not convinced that RQ1 - RQ3 are related to software maintainability, as it is understood from a software engineering perspective.
- I believe authors should drill down and present a comparative evaluation at target application level; do the descriptive statistics presented hold at each application, or are there more interesting findings?

Validity of the findings

- The selection of the 9 algorithms is arbitrary, and introduces an important threat to the external validity of the study. In addition, it is usually the case that algorithm implementations are but a small part of most large-scale systems, so it is not at all clear how the maintainability characteristic that was evaluated using these algorithm implementations will scale upwards.
- A further threat is represented by the fact that the studied algorithms were implemented as part of a software suite to study the performance of different programming languages/runtimes. This could have a further effect on the representativeness of these code bases for larger scale applications developed using those languages.
- With regards to RQ1, authors did not detail the relation between code verbosity and maintainability. Existing methodologies to determine maintainability, and at a higher level than the MI, such as technical debt are concerned with existing best practices, detection of code smells and other weaknesses; as such, it is unclear how the innate verbosity of a language will translate to the maintainability characteristic.
- Regarding the authors' answer to RQ2, the discussion should be based on the implementation of larger-scale software; it should also include a discussion on the source code author(s) programming style, as that can have an impact on these complexity metrics, especially when considering such a limited code base. This is true especially in the case of the NARGS and NEXITS metrics that are not extensively studied in the literature.
- The application of the Halstead time and bugs metrics to a new programming language/construct introduces further threats to validity; these proposed values (division by 18 and 3000, respectively) should most likely be evaluated empirically first. This is partly addressed by the authors in the Threats to Validity section.

Additional comments

The paper is competently written and approaches a subject of current interest in research. However, I believe that the title is out of sync with the paper's contents. The selection of target applications is severely limited, and suitable for an introductory, or position paper on the subject, and not a full journal publication. Furthermore, the selection of metrics to assess maintainability is limited to simplistic measurements. Recent research into maintainability generally employs more complex measures such as technical debt or the impact of code smells (such as measured using SonarQube or Ptidej). Of course, their application assumes a larger target code base to provide meaningful results. Relating to the selection of target applications, the 9 implementations are part of a benchmarking suite, and as such introduce an important threat related to the validity of the conclusions, when these are externalized to other kinds of software (e.g. open-source world or proprietary implementations of large-scale systems).

I believe that in order to work well, the paper should be re-targeted towards examining source code verbosity and understandability across different languages. In this way, the selection of target applications gains relevance, and well-known metrics such as the Halstead suite (that are no longer used to evaluate maintainability) can be more successfully employed.

Reviewer 3 ·

Basic reporting

Authors have followed the professional article structure and shared the raw data. I commend the authors for their work but certain issues need to be resolved before acceptance.

1. Mapping of figures/tables should be thoroughly cross checked with the places they are referenced in the manuscript. Authors need to correct the table and figure referencing
For example,.
In 358, “In the table, we report the mean and…….” Which table authors are referring to?
Line 403 “The boxplots in Figure 4 and Table 9 report the distributions, mean, and median of the Halstead….” Authors have cited wrong table and figure reference. It should be Figure 5 and Table 10.
Similar observation is made at line 435-436. “The boxplots in Figure 4 and Table 9 report the distributions, mean, and median of the Maintainability Indexes computed for the six different programming languages.”. This is repeated line with wrong references.

2. he authors use "we" too much in the paper, while I suggest to use "the paper".

3. Different notations are used for same object. It is recommended that authors should use single term. Some examples are;
Json, json, .json;
line 321 SLOC line 322 souce loc. In line 323 and 324 lines of code. C
OGNITIVE complexity (line 375, 376) or Cognitive complexity (line 390, 395, 396), or cognitive complexity (line 393) or Cognitive Complexity(396).
Program difficulty ( line 406, 409) and Difficulty(line 408)

4. As authors has mentioned in line 438, “Halstead Volume (V), the Cyclomatic Complexity (CC),…”, they must mention the acronyms for all other terms when first used in paper.

5. The paper is well organized. But at some points, restructuring of sentences is required. Few examples are:

Line 269-271: Multiple use of and in one sentence. “Concerning the original implementation of the rust-code-analysis tool, we have forked the project and performed modifications on it by adding metrics computations (e.g., the COGNITIVE metric) and changes to the possible output format provided by the tool.”

Line 440-441: “By using all the formulas for the Maintainability Index, we computed for the source files written in Rust an average MI that placed the fourth among all considered programming languages."
“This very low value of the cognitive per method for Rust is related……..” should be “This very low value of the cognitive complexity per method for Rust is related……..”

6. Minor grammatical errors were exposed. For example,
In 335 “….with the second-highest, mean being 59 for the…..” should be “….with the second-highest mean being 59 for the…..”

In captions of Figure 1, Figure 2 :” Distributions of the metrics about…..” should be “Distribution of the metrics about…”.
Line 95: Systematic Literature review should be systematic literature review.

7. Please check the PeerJ reference format and references should be consistent with that format.
The references in manuscript do not follow a commat format. For example,
Alqadi, B. S. and Maletic, J. I. (2020). Slice-based cognitive complexity metrics for defect prediction. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 411–422. IEEE.
Astrauskas, V., Mu¨ller, P., Poli, F., and Summers, A. J. (2019). Leveraging rust types for modular specification and verification. Proceedings of the ACM on Programming Languages, 3(OOPSLA):1– 30.

Experimental design

The experiments were well implemented, and the results are consistent. Work is novel. A tool is constructed to extract metrics of Rust and object-oriented languages. Metrics are collected for 9 program codes written in 6 programming languages. The paper is well written, the structure makes it easy to follow. Research questions are well formulated.

1. I would request authors to comment on their selection of metrics to be extracted from code. Why they did not extract object-oriented metrics?

2. Algorithms are language independent. Authors have use codes of different languages to do comparative analysis. In Table 6 title, algorithms should be replaced by code. Similarly, in complete text, whenever referring to code, replace ‘algorithm’ by ‘code’.

3. Authors mentioned and analyzed maintainability index in subsection 4.4. I would suggest authors to include some range of maintainability index (for example: bad, average, good, acceptable). This will give more clarity to readers about its relevance.

4. In Table 5, authors have scribed the three variants of MI metric. It is suggested to add reference and little detail for each definition in corresponding section.

5. Table 1 shows that CKJM extracts JAVA and C metrics.
But CKJM collects metrics only for compiled JAVA classes. CKJM stands for Chidamber and Kemerer Java Metrics. It does not work for C code. Authors need to rectify it.
Spinellis D. Tool writing: a forgotten art?(software tools). IEEE Software. 2005 Jul 11;22(4):9-11

Validity of the findings

I appreciate authors to provide all underlying data supporting the replication of the work.

1. In results section, conclusions are well stated for each RQ. But comparative analysis need to be further strengthened by using statistical tests. Authors must include statistical validation of their results. Depending on the nature of data, they can use either parametric or non-parametric tests to statistically validate the results.

2. Conclusion section need to be elaborated. Authors should include main contributions in it.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.