All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Thank you for addressing the reviewers' comments.
One of the reviewers provided a second review and suggested minor revisions to your manuscript. Please look at the reviewer's comments and provide a response to them.
Basic reporting is acceptable.
Acceptable as a proof of principle.
Interesting. This is not a statistically grounded study, it is a proof-of-principle, which makes sense in the context the authors describe.
I feel the article could benefit if the authors consider the following additional points. I don't feel these are actually "minor" points but they could be minor revisions.
1. "Semantics" is a feature of both natural language of statements in formal logic
Authors consistently use the term "semantic" or "semantics" to describe expressions in formal logic designed to be machine interpretable using semantic web approaches. This may seem a bit parochial to general CS readers coming from outside the semantic web culture. Authors might reflect that "semantics" is a strong feature of both natural and artificial / formal languages and that extremely rapid progress is being made in deep learning NLP approaches. Authors might improve this article by simply ensuring that they qualify "semantics" as "formal machine interpretable semantics"' or similar phrasing. Future progress in filling the "gap" mentioned by the authors may well come from combining semantic web and deep learning approaches.
2. "Conditional probability” vs. Likert scale
The authors provide what they call a "conditional probability" term in their formal nanopublication statements. But I get the distinct impression that this term does not reflect the actual numerical results in an article – how could such a generalization be possible in any event? – rather it seems to represent a subjective likelihood asserted by the authors, i.e. a value in a Likert scale. This would be a simple numerical substitute for the complex linguistic hedging typically found wrapped around the textual claims in any scientific publication. I recommend that this be made clear in the text.
The reviewers have provided long and detailed reviews. They have noted fundamental issues that should be addressed, both to ensure that the manuscript has practical relevance and to ensure that the content is sufficient to describe and justify the approach. One reviewer asked you to cite specific articles. Although it may or may not be helpful to cite those specific articles, it highlights a potential need to be more comprehensive in the works cited.
[# PeerJ Staff Note: It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful #]
The authors describe an effort to publish versions of original papers – or essential claims of original papers – in the nanopublication format, in RDF, using a “super-pattern” version of nanopublications structured as an ordered series of slots: Context class, subject class, qualifier, relation, object class. The implied but not rigorously stated claim is that publishing scientific articles natively as nanopublications using this pattern will make it more possible “for researchers in different disciplines to keep up-to-date with the recent findings in their field of study” (from the Abstract).
The language is clear, unambiguous, and professional, with no obvious grammar errors, and the structure is appropriate.
The authors are admirably persistent in continuing to work through the nanopublications concept, in which they have an established publications record. However, this paper does not state its research aim clearly or contextualize it well. the scientific question is neither well-formulated nor well-referenced. Yes there are a lot of references but not to the critical part of their argument.
The authors have much more to do to show that their approach has applicability to the problem they claim to address, and is not merely an exercise for a limited audience of computer scientists. Their purpose as they state it is to fill a gap in the existing ecosystem of biomedical publishing. They claim this can be done by changing the form in which articles are published and adding machine interpretability. They have to document this gap and show that the particular model they chose fills or has demonstrable promise to fill it; and that it is an important gap. They don’t do this:
To be very specific:
- lines 33-40 there are no references provided to show importance of the problem and any prior work on solutions; why prior solutions are inadequate; how and why this solution would solve the problem; in the Background section there is more referencing but the logical flow of the argument leaves a unreferenced gaps as follows:
- lines 113-114 Wong 2019 does show greater volumes of publications - but there is then a leap to "methods...to make scientific articles more machine-readable" as a solution without any evidence that this will be useful; it is handwavy; I feel certain that the authors could provide some beteer referencing and they should – otherwise the potential applicability of the proposed solution is left out - and it is quite important. They also ought to show what other work may have been done to handle the “too much information” problem, and how that problem is seen in comparison to others by the biomedical community.
- lines 121-122: "While the results can be very valuable there are also clear limitations, with the resulting data needing almost always manual curation to achieve decent quality"
- but this article itself is in essence a curation demonstration - they describe volunteers re-representing claims from several original articles as nanopublications – and the quality control is done here by questionnaire to the curators asking them whether they are satisfied with the results.
- lines 124-127 "A significant number of vocabularies and ontologies in many various domains have been developed, which are now ready to be used for scientific knowledge representation…However, they remain often difficult to find, access and understand due to the lack of documentation, versioning problems, and unresolvable URIs, among other things"
- it is hard to understand this statement which implies ontologies are not being used when in fact they are used intensively *by trained curators* working in biomedical knowledgebases such as UniProt.
- if in reality the authors of this article want the rich corpora of biomedical ontologies to be used by the primary producers of biomedical knowledge, this rather ignores that biomedical curation is a recognized discipline in knowledge production, requiring specialized training and having its own professional society.
- lines 229-231 "[when] the class that should be filled in to arrive at a correct formalization is not directly defined in any existing vocabulary or ontology ...this class might need to be minted as well" but now we have a tag cloud in which the tags are new un-curated “classes” not connected to existing ontologies – this is why biocuration is a specialized discipline and speaks to integrability.
The Figures could be improved both as to selection and content.
- Figure 3 seems marginally relevant if at all.
- Figure 4: the text is far too small to read without difficulty; the formalization of this paper's main claim "Mutations in STX1B are associated with epilepsy" loses fidelity from a straight rendering of the title "Mutations in STX1B, encoding a presynaptic protein, cause fever-associated epilepsy syndromes" or the last line of the abstract "Our results thus implicate STX1B and the presynaptic release machinery in fever-associated epilepsy syndromes" - note that the finding is about STX1B and fever-associated epilepsy syndromes, not simply epilepsy, and that STX1B is said to cause, not simply be associated with such syndromes; the formalization text does not have a link to STX1B in any ontology.
- Figure 5: Shows a "readable" view of a structured abstract of a non-biomedical paper (Wilkinson et al 2016), written by a data scientist. Is this one of the papers actually produced in this study? A structured abstract would be particularly easy to represent as a nanopublication, while the critical content areas of a real paper are not shown. As is well-known, abstracts often problematically do not quite conform to the contents of the actual paper. I was looking for some illustration of how the human-readable formulation of one of the annotated papers would appear.
As noted above, the research question is not well-posed or contextualized. The "experiment" or annotation exercise is not shown to be relevant to transforming the publication methods of the majority of biomedical researchers, but deals only with a small self-selected group who then self-evaluates. As a pilot study this would be far more convincing if actual biomedical researchers (no data scientists or computer scientists allowed) would be asked to evaluate the results.
The raw data and software do not seem to be provided.
As noted earlier, the evaluation was a questionaire to the people who produced the annotations, who were a self-selected group and should have been limited to biomedical researchers to keep the techno-people out. Biomedical researchers can be quite technical but the majority of lab workers I have known are quite allergic to the computer scientists' way of thinking and so this needs to be taken into account in both the production and the evaluation of the "formalization papers".
A point I would also make to the authors is that their example on lines 71-79 taken from Felix and Barrand 2002 is troubling in terms of loss of fidelity and missing the main point of the article. They say (line 79) that the claim can be rendered informally as
“whenever there is an instance of transient oxidative stress in the context of an instance of a rat brain endothelial cell, then generally (meaning in at least 90% of the cases), that instance of stress has the relation of affecting an instance of Pgp expression.”
But in fact, let’s look at the article’s title:
“P-glycoprotein expression in rat brain endothelial cells: evidence for regulation by transient oxidative stress”
The title say the article is presenting evidence that transient oxidative stress -regulates- PGP expression, not merely -affects- it. And there is no statement that “whenever …(meaning in at least 90% of the cases)” in the article anywhere. Lastly, the context is actually not a rat brain, but an artificial experiment in a lab, which makes the experimental conditions, the “evidence for regulation by transient oxidative stress” in the title a critical detail.
The PGP authors state in the last line of the abstract that “oxidative stress, by changing Pgp expression, may affect movement of Pgp substrates in and out of the brain.” This is the key take-home from the article.
And the key question a scientific reader will ask is, what is the evidence that this artificial experiment mirrors what really goes on in a rat brain? And whether this content is present affects the utility of the re-representation.
It would be good for the authors to address these points.
I am sorry to be so negative about this article. But I think the authors need to address many of the points shown above to be ready for publication and to be relevant to the problem they claim to be addressing.
• The major weakness is the small sample size which reduces the power of the study and increases the margin of error.
• The caption and the format of some tables could be improved, e.g., in table 5, the caption is not self-descriptive.
• You must cite and discuss related work and compare their results with yours.
• Some examples where the language could be improved include lines 40, 71, 86, 138, 398, …
• related work section is missing. even though you mentioned some in the background section.
• You have to clearly mention the new contribution(s) compared to your previous work (i.e (Bucur et al., 2019) and (Bucur et al., 2020))
• I do not see the importance of having Fig 3. how the timeline for the special issue with formalization papers at the Data Science
• journal is relevant to the content?
• the structure of table 2 is not clear, e.g what do you mean by the value 2.27 in `per submission` for class definitions column
some broken links:
• Fig 6: https://raw.githubusercontent.com/LaraHack/fpsi analytics/main/np-graph.svg
• footnote on page 11: https://github.com/LaraHack/formalization
• papers supplemental/tree/main/nanopubs
The research question is not clearly defined and the knowledge gap being investigated should be identified, and positioned in the literature. I suggested several relevant studies below.
The Analysis of Nanopublications is not sufficient. the authors just reporting some statistics about the nanopublications they collected. Authors' interpretations of these figures are needed.
This work is a good step toward making scientific articles machine-interpretable by expressing scientific claims in these articles with formal semantics (i.e in RDF format). They use the concept of nanopublications for this goal. The whole submission process, including reviews, responses, and decisions, is considered as well.
I have a question for the authors? Who is responsible to use your super-pattern template to represent the meaning of the scientific claims in formal logic? I expect the answer to be "article authors". If so, could you grantees that the authors (i.e outside the CS community) are aware of writing scientific claims in formal logic? And what motivates them to do so?
suggested relevant studies:
• Said Fathalla, Sören Auer, & Christoph Lange (2020). Towards the semantic formalization of science. In SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing, online event, [Brno, Czech Republic], March 30 - April 3, 2020 (pp. 2057–2059). ACM.
• Sahar Vahdati, Said Fathalla, Sören Auer, Christoph Lange, & Maria-Esther Vidal (2019). Semantic Representation of Scientific Publications. In Digital Libraries for Open Knowledge - 23rd International Conference on Theory and Practice of Digital Libraries, TPDL 2019, Oslo, Norway, September 9-12, 2019, Proceedings (pp. 375–379). Springer.
• Said Fathalla, Sahar Vahdati, Sören Auer, & Christoph Lange (2017). Towards a Knowledge Graph Representing Research Findings by Semantifying Survey Articles. In Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, September 18-21, 2017, Proceedings (pp. 315–327). Springer.
•
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.