All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
After several rounds of revision, all reviewers agree that the authors addressed all comments, and the article is ready for publication. Thanks for your effort in improving the paper!
No Comment
No Comment
No Comment
One of the reviewers is still asking for a statement summarizing the results of the evaluation. Please take a look at the revision and resubmit your paper.
no comment
no comment
no comment
I understand that the authors addressed all of my previous comments, and it is ready for publication.
With respect to the results, I think my previous comments may have misled. I care about what the evaluation **shows**, not specifically what is contained in the supplemental files. The paper states "we evaluated CAESAR with the REPRODUCE-ME ontology using competency questions collected from different scientists in our requirement analysis phase"; "The domain experts reviewed, manually compared, and evaluated the correctness of the results from the queries using the Dashboard"; and "The domain experts manually compared the results of SPARQL queries using Dashboard and ProvTrack and evaluated their correctness." I see the text regarding issues with null results and modifications, but I don't see any statement summarizing the results of the evaluation. Did the domain experts find that after the modifications, all the competency questions were satisfactorily answered? If so, please state that. If not, it seems like having details about what was problematic would be helpful. (If I'm missing how the competency evaluation works, perhaps clarifying that in the text would also help. Evaluation to me implies there are results that give an indication of how well something works.)
"We used an example Jupyter Notebook which uses a face recognition example applying eigenface algorithm and SVM using scikit-learn" is still not quite right to my ears. Perhaps "... example where eigenface and SVM algorithms from scikit-learn are applied."
No comment
No comment
The reviewers agree that the paper improved a lot, but they still request a few minor issues to be improved.
As a minor issue, the main research "question" in line 466 is not phrased as a question
The paper improved in reporting the experiment design, but it still can improve the wording on the experiment section:
The paper divides the main question into 3 parts in the first paragraph of the Evaluation, but it does not reference these parts explicitly. For instance, the paper states that in the first part, they "address the question of capturing and representing the complete path of a scientific experiment" and for that, they evaluate:
- the role of CAESAR in capturing the non-computational data
- the role of ProvBook in capturing the computational data
- the role of the REPRODUCE-ME ontology in semantically representing the complete path ... using the competency question-based evaluation
Then, the next subsection is called "Competency question-based evaluation", indicating that it only covers the 3rd item (the role of the REPRODUCE-ME ontology).
Moreover, by reading this subsection, it seems that it in fact does not cover the role of ProvBook in capturing the computational data (which is somewhat discussed in the next subsection), but it does have a discussion about the role of CAESAR in capturing the non-computational data.
no comment
The text is much improved, and the reduction in passive voice makes the text easier to read.
I would recommend a better inline summary of results in the evaluation section. I'm not sure why the reader is expected to go to a different file to read about them. I realize that having the open, raw results available is great for those that want to poke around, but I think a summary for readers is useful.
- L563: "The competency questions, the RDF data used for the evaluation, the
564 SPARQL queries, and their results are publicly available (Samuel, 2021)." (What are the results?)
- L604, "The files in the Supplementary information provides the information on this evaluation by showing the difference in the execution time of the same cell in a notebook in different execution environments." (What do they show?)
- L630 "The results of the evaluation are available in the Supplementary file." (What are the results?)
The user-based evaluation subsection does have results in the manuscript.
Typo in L581: "uses face recognition example" -> "uses a facial recognition example"
The additional information about the use case in the "Data and user-based evaluation of ProvBook" section (L581) is helpful, but I would argue this is more of an example and probably should be tagged as such. We don't know, for example, that User 2 needed to look at User 1's provenance to fix the error. We also don't know if User 2 is a more experienced Python programmer, or even if User 2's environment (Fedora) made this easier. Because n=1, I don't think we can conclude that ProvBook is the reason for improved performance, but this use case is instructive in understanding why we should expect ProvBook to help.
I think the text reads ok here now.
Reviewer #3 pointed out some questions about the experimental design of the article. I suggest that authors consider all reviewers' comments, especially the comments of reviewer #3.
Reviewer 2 has suggested specific references. You may add them if you believe they are especially relevant. However, I do not expect you to include these citations, and if you do not include them, this will not influence my decision.
[# PeerJ Staff Note: Please ensure that all review comments are addressed in a rebuttal letter and any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate. It is a common mistake to address reviewer questions in the rebuttal letter but not in the revised manuscript. If a reviewer raised a question then your readers will probably have the same question so you should ensure that the manuscript can stand alone without the rebuttal letter. Directions on how to prepare a rebuttal letter can be found at: https://peerj.com/benefits/academic-rebuttal-letters/ #]
no comment
The research question ("it is possible to capture, represent, manage and visualize a complete path taken by a scientist in an experiment including the computation and non-computation steps to derive a path towards experimental results") is too broad, and the evaluation does not seem to be completely related to it.
The evaluation had three parts, which considered distinct aspects:
- Competency question-based evaluation: the competency questions evaluate the possibility of capturing and representing the complete path. Arguably, the system also had to manage it, but there is no visualization evaluation in this part.
- Data-based evaluation: this evaluation was related to reproducibility, performance, and scalability, which are not part of the main research question.
- User-based evaluation: this evaluation was related the representing and visualizing the complete path to users for a usability study.
The paper could break the main research question into smaller questions for each of these parts and discuss the results and implications separately.
no comment
The paper evolved well, and the authors addressed most of my comments appropriately. However, the paper still has a weakness with the experimental design.
The paper addresses the concerns I had.
One minor addition might be a reference to a survey on provenance. Like
Wellington Oliveira, Daniel De Oliveira, and Vanessa Braganholo. 2018. Provenance Analytics for Workflow-Based Computational Experiments: A Survey. <i>ACM Comput. Surv.</i> 51, 3, Article 53 (July 2018), 25 pages. DOI:https://doi.org/10.1145/3184900
or
Herschel, M., Diestelkämper, R. & Ben Lahmar, H. A survey on provenance: What for? What form? What from?. The VLDB Journal 26, 881–906 (2017). https://doi.org/10.1007/s00778-017-0486-1
no comment
no comment
The text is improved, and much of the passive voice has been addressed, but there are still a few places where the text could be improved. Specifically, in the evaluation:
- "The evaluation was done" Who did this evaluation?
- "Several runs were attempted to solve the issue". Who attempted these runs? User 1?
- "This was resolved by installing the sci-kit-learn module." Was this User 2, or did ProvBook do this?
Also a few other spots in the text:
- "We, therefore, argue", actually like without commas or to swap order to "Therefore, we argue"
- "we aim to create a conceptual model" -> "we create a conceptual model"
- "We use D3 JavaScript library" -> "the"
- "obtained 3 responses" => "received three responses" (got doesn't always translate to obtained)
I didn't see Brank et al. define competency questions in the survey paper so was still a bit lost about how "produce good results" (line 560-561) is actually evaluated. Reading further and the 2018 paper on "The Story of an Experiment", it sounds like domain experts reviewed the results from the queries to verify them. There are some details about some queries not working and requiring OPTIONAL, but should we assume that the domain experts were finally happy with all competency queries? In any case, moving this domain expert verification statement up earlier would help clarify who is doing the evaluation.
I still don't understand the "Data and user-based evaluation of ProvBook in CAESAR" section. The general factors seem reasonable to care about, but I don't understand how this was *evaluated*. Also, the passive voice (see Basic Reporting) makes this harder to understand. Who did the evaluation? How was this evaluated? The description of how notebooks were used and updated is interesting, but the focus should be on ProvBook. The one sentence here is "Using ProvBook, Users 1 and 2 could track the changes and compare the original script with the new one" How did this help them during the study? Was one user able to use ProvBook and the other not able to use it? That type of study would allow conclusions that ProvBook improved reproducibility because, for example, User 2 took less time to reproduce the original notebook.
In the User-based evaluation of CAESAR, I also was left wondering how the "None of the questions were mandatory" piece is addressed. Were the participants required to use the tool for a certain amount of time? Are the "questions" the competency questions or the survey questions? If the survey questions were not required to be answered, why? If the competency questions were not required, how did you verify whether the users used the tool for its intended purpose?
Unfortunately, I still feel that the conclusions are rather limited, and it is unclear how they follow from the evaluations as described. This is likely due to experimental design notes above, but the discussion section still states conclusions that are not in evidence: "The results of the data and user-based evaluation of ProvBook in CAESAR show how it helps in supporting computational reproducibility." I suspect that ProvBook in CAESAR does help support computational reproducibility, but I don't see how what is described in that evaluation subsection validates that.
Overall, all reviewers agree that the research sounds good. However, they think that related work and experimental design and reporting need to be more elaborated. One of the reviewers asked to make the contributions of this article clear in the sentences. All reviewers also suggested some reorganization in the paper. Please consider these comments in your new version.
[# PeerJ Staff Note: Please ensure that all review and editorial comments are addressed in a response letter and any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate. It is a common mistake to address reviewer questions in the response letter but not in the revised manuscript. If a reviewer raised a question then your readers will probably have the same question so you should ensure that the manuscript can stand alone without the response letter. Directions on how to prepare a response letter can be found at: https://peerj.com/benefits/academic-rebuttal-letters/ #]
Strengths:
S1. The paper shows the context (scientific experiments) and motivates the need for collecting computational and non-computational provenance from the early stages of the experiments to support their reproducibility.
S2. Figures are relevant, high quality, well-labeled, and described. The architecture figure provides a good overview of the approach that is helpful for following the description. The other three figures provide examples of analyses that CAESAR supports.
Weaknesses:
W1. The related work section seems outdated. The paper states "Only a few research works have attempted to track provenance from computational notebooks (Hoekstra 2014; Pimentel et al., 2015; Carvalho et al., 2017)", but there are newer approaches that also track provenance from notebooks:
- Koop, David, and Jay Patel. "Dataflow notebooks: encoding and tracking dependencies of cells." 9th {USENIX} Workshop on the Theory and Practice of Provenance (TaPP 2017). 2017.
- Kery, Mary Beth, and Brad A. Myers. "Interactions for untangling messy history in a computational notebook." 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 2018.
- Petricek, Tomas, James Geddes, and Charles Sutton. "Wrattler: Reproducible, live and polyglot notebooks." 10th {USENIX} Workshop on the Theory and Practice of Provenance (TaPP 2018). 2018.
- Head, Andrew, et al. "Managing messes in computational notebooks." Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 2019.
- Wenskovitch, John, et al. "Albireo: An Interactive Tool for Visually Summarizing Computational Notebook Structure." 2019 IEEE Visualization in Data Science (VDS). IEEE, 2019.
- Wang, Jiawei, et al. "Assessing and Restoring Reproducibility of Jupyter Notebooks." 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2020.
W2. Some older approaches are also missing in the related work: approaches that collect provenance at the OS level usually are able to interlink data, steps, and results from computational and non-computational processes, once the non-computational processes are stored in the computer. However, more notably, Burrito collects the OS provenance and provides a GUI for documenting the non-computational processes and annotating the provenance:
- Guo, Philip J., and Margo I. Seltzer. "Burrito: Wrapping your lab notebook in computational infrastructure." (2012).
W3. The "Design and Development" subsection of "Background" is confusing. On one hand, it is dense and enters into implementation details of the approach that are later explained in the "Results" section. On the other, it is too shallow and does not explain important concepts and tools used by the paper. For a better comprehension of the paper, the background section could answer the following questions:
- What are the main features of REPRODUCE-ME? How does it compare to other provenance ontologies such as PROV and P-Plan?
- What is ODBA? What ate the benefits of using it?
W4. The structure of the paper does not conform to the Peerj standards (https://peerj.com/about/author-instructions/#standard-sections): it has a "Background" section instead of a "Materials & Methods" section. Additionally, a big part of the "Results" section describes the approach in detail instead of describing the experimental results. Both the "Background" and the approach proposal could be in the "Materials & Methods" section and solve W3 partially.
W5. While the raw data was supplied (great!), the cited page (Samuel, 2019b) is generic for multiple research projects, and finding the raw data associated with this specific paper requires navigating through some links to reach the GitHub repository with the data (https://github.com/Sheeba-Samuel/CAESAREvaluation). This repository could be cited directly. Additionally, there is no description on the repository describing the structure and the data, making it hard to validate.
Strength:
S3. The supplemental files are well described and allow the replication of the usability experiment.
Weaknesses:
W6. The originality of the paper evaluation is not clear. It seems that CAESAR was introduced and partially evaluated in (Samuel et al., 2018), where the first half of the evaluation occurred (the definition and evaluation of the competency questions). However, the usability evaluation seems original. The paper should state clearly what is original in this paper and how does it improve from (Samuel et al., 2018).
W7. The paper does not report the results of the competency query evaluation. It indicates that each question addressed different elements of the REPRODUCE-ME Data Model, but it does not present the questions nor the results. (Disregard this weakness if the evaluation is really part of another paper. In this case, reinforce it in line 634).
W8. The research questions are not well defined. While the main motivation of the paper is based on reproducibility, the current evaluations seem to assess understandability and usability, instead.
Weaknesses:
W9. The paper should indicate the threats to the validity of the experiments. Given the size of the population, it is likely that the experiment has an external threat to validity with statistical results that are not sound. Additionally, the usage of CAESAR did not occur in a controlled environment, which also leads to a threat to internal validity.
W10. The conclusion claims that the approach addresses understandability, reproducibility, and reuse of scientific experiments, but the experiments do no support these claims
The report is mostly there but I think could have some additional support.
The paper states: "However, this is too little too late: These measures are usually taken at the point in time when papers are being published" Give some examples of what these measure are that are insufficient.
In Line 117 - 'To the best of our knowledge, no work has been done to track the provenance of results generated from the execution of these notebooks and make available this provenance information in an interoperable way". I think this claim of novelty is not necessary to make so bold. For example, I think NBSafety (http://www.vldb.org/pvldb/vol14/p1093-macke.pdf) and Vizier https://vizierdb.info are both related work that are in the same area as the reported tool. Here I think the important thing is that PeerJ is not about novelty but rather if the work is sound. The particular work has its own perspective and is useful but I think it's unnecessary to make these claims of research novelty. At least provide some deeper justification.
Lastly, when introducing the FAIR principles say what they stand for.
The experimental design and reporting need to be better explained.
In section Background Evaluation - you seem to describe a number of different evaluation strategies. I had trouble figuring out the various strategies and telling them a part. It would be beneficial to distinguish each of the evaluation strategies and make it clear what approach you used in each of them. Maybe simply labeling them would help. For example, you seem to have an application evaluation, a competence question based evaluation, user based evaluation?
The lack of organization of the material is also present in the presentation of the evaluation results. Please clearly describe and distinguish which evaluation strategies were used and the specific evaluation results there were. For example, it was unclear why a notebook with face recognition (line 574) was being used for evaluation. Likewise, I'm not sure if a user survey study of 6 participants is enough when the other evaluation approaches are not clearly demarcated.
Be careful of claims made in the system description without evidence. For example "The provenance information is stored in CAESAR to query this data from different sources efficiently." These claims when made need to be substantiated.
I believe the majority of the findings are valid given that this is a report of a software platform but given the reporting on the method it was hard to determine whether the evaluation results in their totality showed what the authors said it shows.
In general, a nice contribution of the system but I the reporting around the evaluation needs to be clearer for the reader. Also, the claim of novelty I don't think needs to be made and instead the focus should be on the contribution of the system and validating the claims of the system
The paper's overall structure is fine, but the writing should be improved. There is a lot of passive voice used that should be changed. For example, "it is also required to represent and express this information in an..." could be restated as "information should be represented and expressed in an..." Also, there are places where "got" is used that could use more precise language (e.g. obtained). I also would advise against using quotation marks for defining terms (e.g. "provenance") as that brings a connotation that is likely not intended; use italics or bold instead. In general, the paper is wordy, and sentences or phrases can be shortened or even omitted. In addition, the results of the evaluation seem to be discussed twice (once in the Evaluation subsection [p. 11] and again in the Discussion section [p. 14]). The paper is professional, the raw survey results are shared, and the source code is available. There are some places where terms could be defined earlier (e.g. FAIR in the introdution is not explained, JupyterHub [p. 8]).
The paper is a research paper, detailing a framework and the evaluation of it. The framework, CAESAR, seeks to help users create and maintain reproducible experimental workflows. The introduction lays out the reasons why reproducibility is important and how provenance can help, and the contributions of CAESAR, ProvBook, and REPRODUCE-ME. Much of the paper provides details on the architectures of these frameworks which is expected.
To me, the paper does not do well enough describing the experimental design. The paper states that competency questions were used to evaluate CAESAR, then discusses data and user-based evaluation of ProvBook, and finally discusses a user-based evaluation. It is not clear to me what the competency questions are nor what "plugged the ontology in CAESAR by using these competency questions" means. I was unsure if the paragraphs describing evaluation were related or separate experiments. It is also unclear what the user-based evaluation measured. Users had to upload data, but nothing else was mandatory? The number of users (6) is too low to make significant statistical conclusions, and the evaluation seems entirely subjective.
I don't think the conclusions are overstated but rather the conclusions seem quite limited. Due to some issues with the experimental design (see 2), I think it is difficult to judge the conclusions. It is important for users to "like" a system, but it would be helpful to know more about the use cases where the framework has been used.
This paper describes a lot of work in the areas of reproducibility and provenance which are important areas for computational science. The paper describes how domain scientists in biological imaging face particular challenges and how CAESAR, ProvBook, and REPRODUCE-ME were designed to help them. It seems like there has been a lot of work done which is important to both the bioimaging community and the broader community interested in reproducibility and provenance.
As a paper describing these frameworks, it goes into great detail about the particular implementation, which is important documentation, but is not particularly useful for readers who do not use this specific system. There is certainly a tension between not providing enough detail on how the system works and writing user documentation in a research paper, but I feel the paper has too much of the latter. The long lists describing panels and the REPRODUCE-ME schema should be moved to supplemental material.
The threads related to the core problems of supporting reproducibility in the bioimaging workflows are sometimes lost in the details of the system. I would strongly encourage adding a running example to the text to help readers understand specific use cases and how the frameworks help users to address them. In this context, specific features can be detailed. This should also lead to greater coherence among the three pieces that are detailed in the paper.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.