Review History

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.


  • The initial submission of this article was received on February 3rd, 2021 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on March 15th, 2021.
  • The first revision was submitted on December 14th, 2021 and was reviewed by 1 reviewer and the Academic Editor.
  • A further revision was submitted on February 9th, 2022 and was reviewed by the Academic Editor.
  • A further revision was submitted on March 8th, 2022 and was reviewed by the Academic Editor.
  • The article was Accepted by the Academic Editor on March 10th, 2022.

Version 0.4 (accepted)

· Mar 10, 2022 · Academic Editor


The language revisions made were satisfactory.

[# PeerJ Staff Note - this decision was reviewed and approved by Brenda Oppert, a PeerJ Section Editor covering this Section #]

Version 0.3

· Feb 17, 2022 · Academic Editor

Minor Revisions

Please note that language problems still remain in your manuscript. For example, "MetaPhlAn2 tool (Ditzler et al., 2015b) is widely in literature" (lines 189-190) [should be widely USED]; "...Decision Tree, Random Forest, LogitBoost, AdaBoost, an ensemble of SVM with kNN, and an ensemble of the Logitboost with kNN is considered" (lines 404-405) [should be ARE considered]. These are just two examples; thorough checking of the entire text should be carried out, preferably by a fluent English speaker.

[# PeerJ Staff Note: The Academic Editor has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at for pricing (be sure to provide your manuscript number and title) #]

Version 0.2

· Jan 5, 2022 · Academic Editor

Minor Revisions

The revised version of the manuscript has been seen by one of the previous reviewers. They acknowledge the improvements, but the Reviewer still has several minor concerns that need to be addressed.


Basic reporting

The use of "our", "we" throughout the manuscript makes the writing quite informal and I would recommend that this be changed throughout. The manuscript is the presentation of a scientific investigation. It is not owned by, nor belongs to, any set of authors as the same study can be proposed and implemented by others. Therefore, it is more appropriate to refer to "The study", or "the results" and what explicitly occurred during the investigation, rather than "Our study" "Our dataset", "we applied" etc..

The section "Validation on external data" includes the use of , instead of . for decimal points and should be changed.

References are generally appropriate and in place. Page 9, websites should be referenced properly alongside the other references (including the date the website has been accessed on since webpages can change).

Figure 1: No of sick/healthy samples should be No. of ... or # sick/healthy samples.

Table 1: There are a lot of values. Perhaps it would be helpful to highlight in bold the highest accuracy/recall/specificity etc. in each sub-table and the lowest std. dev. .

In the Dataset and Preprocessing Steps section (and elsewhere as appropriate) all weblinks should be properly referenced in the list of citations including date on which the content was accessed since the contents of the weblinks can be changed.

Changes in Figure 4 mean that the units for Accuracy are no longer %.

Please clarify to which dataset the results in Figure 5 belong to.

Figure 7: "Relative abundance" ....relative to what? Y-axis now has values but remains unlabelled and without units.

Figure 8:Y-axis is still not labelled/no units provided.

Figure 10 is still difficult to read although is more legible from the original submission. I think the authors should reflect on the "so what?" question here: What do you want the reader to take away from this figure? There is a lot of information presented in Figure 10. If you feel the reader needs all of this information in order to understand the conclusions drawn then please make the figure larger and more legible. Perhaps highlight specific regions of the figure and label with (a), (b), (c) etc to assist with your discussions. Alternatively, the authors may wish to revise the figure to present the key information (i.e. less information) more clearly to the reader so that they can interpret the results and follow the conclusions discussed in the paper. You can generate a Hierarchical clustering map...this doesn't mean you should.

Experimental design

Line 442: "Very similar preprocessing"...How similar? What was the same? What was different and why are the differences insignificant?

In the validation on external data RF is isolated however the initial findings from Table 1 would suggest LogitBoost would be equally as strong or appropriate to use. The authors may wish to reiterate why they focus on RF explicitly.

The inclusion of the independent dataset has improved the rigour of the study.

The research is original and of interest to the audience of the journal.

Validity of the findings

Line 424 "As shown in Figure 5..." These findings can also be found in Table 1.

Table 1 has a lot of values. Perhaps it would be helpful to highlight in bold the highest accuracy/recall/specificity etc. in each sub-table and the lowest std. dev. .

In Table 1 it is worth commenting that CMIM, FCBF and MRMR show signs of poor fitting across all models with low accuracy and high recall. (re lines 411-413).

Results Section (from line 419-...) RF is singled out here but it could be argued LogitBoost performed best (or as good as RF) when all 1331 features are used.

Line 431-432: "diagnosis could be performed with 88% accuracy" - so what? Is good enough for clinical adoption? At present this sentence just repeats what is in the table but doesn't really make a statement. The subsequent "only analysing the amounts of 14 specific species" - More appropriate to say "14 features are sufficient" to accurately classify IBD compared with using all 1331 features. (Note the typo: "by checking only analyzing".

Line 433-434: I think the final sentence of this section needs to be rephrased, perhaps purpose rather than proposed. Also the subsequent Validation on external data seems to conflict with this point since it appears that only 10 species are required. If line 434 is a hypothesis then test this on the independent external dataset (which turns out to not be fully possible - hence explain why and possibly revise under which circumstances people would use the 14 species). Furthermore given that 10 species are highlighted in the validation dataset it might be appropriate to leave any conclusion/propositions until after all the results are presented.

Validation on External data section: There are references that the results are shown in Table 1 but I was unable to see these findings - perhaps a table is missing?

Line 455-459: This appears to be contradictory. 10 selected species yielded higher performance metrics in the validation than exploration. Therefore the 4 additional features are beneficial? Please clarify - are you saying, that the 10 species from the validation cohort considered in isolation in the exploration cohort are less useful than 14 features in the exploration cohort? Furthermore, using the 10 selected species in both cohorts it was found that the exploration cohort was easier to classify than the validation cohort - in which case why might this be the case?

Additional comments

As a side note, it would be appreciated if the authors could cross-reference where the changes can be observed in the revised manuscript in their responses.

Version 0.1 (original submission)

· Mar 15, 2021 · Academic Editor

Major Revisions

The reviews provided are detailed and constructive. All three reviewers have expressed major concerns, especially with possible model overfitting. Reviewer 1 and Reviewer 2's suggestion that the findings should be validated on a larger open-sourced dataset needs to be implemented.

[# PeerJ Staff Note: It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful #]

[# PeerJ Staff Note: Please ensure that all review comments are addressed in a rebuttal letter and any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.  It is a common mistake to address reviewer questions in the rebuttal letter but not in the revised manuscript. If a reviewer raised a question then your readers will probably have the same question so you should ensure that the manuscript can stand alone without the rebuttal letter.  Directions on how to prepare a rebuttal letter can be found at: #]


Basic reporting

• The writing is clear with a few small mistakes in places
• The literature review is missing some important papers in the area (details provided in following sections)
• Strong claims about elements of the microbiome causing disease are made (lines 17 – 18, 39 -42). For the majority of cases only associations have been found, direct causation is rarely confirmed. Strong claims about causation should have strong evidence provided.
• The aetiology of IBD is complex and many other factors are associated with IBD, not just the gut microbiome (lines 19 – 20). This should be made clearer.
• Figure 5 should have y axis labels and confidence intervals to allow proper comparison of models
• Figure 6 C is impossible to read and understand, it’s too small
• Figure 7 is missing x and y axis labels
• Figure 8 is missing y axis labels, so I don’t know what it’s measuring (relative abundance?). There are too many groups and it’s difficult to tell species apart. I recommend visualising only the 10 or 12 species most abundant and combining the others into an “Other” category
• Figure 9 A) B) is very difficult to understand from a PDF (it’s 3D). I recommend including an interactive 3D plot in the supplemental instead.

Experimental design

• The methods section doesn’t describe the process of processing raw MetaHit sequence data to create the metagenomics dataset. Were the sequence data quality checked with FastQC and multiQC? Were non-biological sequences (e.g. adapters) removed with trimmomatic? Metaphlan2 uses a read mapping approach to match against reference genomes, so removing non-biological sequences is important. It can’t be assumed that these steps were done by the MetaHit team that uploaded their data.
• The methods section doesn’t describe preprocessing done to the metagenomic data. Different samples have a different number of bacterial reads, how was this normalised? See and
• What criteria is used to define if a patient has inflammatory bowel disease? (E.g. the Montreal classification) Were the patients treated with antibiotics? This will change the structure of the microbiome significantly. Was IBD in an active or remissive state when the samples were taken? I think a more thorough description of patient demographics is required – perhaps a table?
• Unfortunately, from Figure 2 it appears the feature selection process is fatally flawed, causing all estimates of predictive power to be overstated. Feature selection is done on the entire metagenomics dataset, and models are trained on the reduced dataset using cross-validation. Feature selection should instead be included inside the resampling method to estimate the stability of the feature subset and avoid overfitting. Ambroise and McLachlan (2002; and Svetnik et al (2004; showed that improper use of resampling to measure performance will result in models that perform poorly on new samples.
• Spearman rank correlation should not be used on metagenomic data because it is compositional (the sum of all bacterial proportions in each sample is constrained to 1), see 10.1371/journal.pcbi.1002687,
• Comparing model performance should involve analysing confidence intervals generated from cross-validation resamples and using a statistical approach
• I’m not sure what the unsupervised approach contributes significantly to the paper. From Figure 10 there are two small clusters but there are many confounders that could contribute to this (e.g. antibiotic use or other treatments). To understand the relationship between samples and species other methods could be used such as interpretable machine learning (e.g. SHAP arXiv:1705.07874)

Validity of the findings

• The flawed feature selection procedure used means all prediction estimates are invalid. They should be recalculated with the correct feature selection procedure, incorporating feature selection inside of the cross-validation resampling process
• Given the performance claims made about the models (AUROC approaching 1 in Figure 4) I think it is very important that the trained models are externally validated. There are many large metagenomic datasets available such as Gevers et al. 2014 which contains more than 1000 samples (10.1016/j.chom.2014.02.005). Previous machine learning work predicting IBD has used external validation to demonstrate the ability of trained models to generalise (10.1109/TCBB.2018.2831212). Qiita ( is an excellent platform to browse and discover new microbiome datasets

When using external validation data it's important that the data can be harmonised across data sources (e.g. species or genus names match, and the same reference database is used)


Basic reporting

This article is generally well written, uses a professional academic structure, and is relatively self-contained. The argument presented is clear and uses professional English.

Further supporting information is required for the claims made in lines 38-40.

There are some minor spelling and grammatical errors including:
Line 60: referencing style should be kept consistent throughout.
Line 297: Euclidean (E not e).
Line 308: Remove ?
Lines 325-326: Grammatical error "applying a careful feature selection". Please rephrase.

There are some difficulties with the following Figures:
Figure 4: Please use . instead of , for the decimal point. Please also add (%) as units to the header of the Accuracy columns. Increasing font size would be desirable here even if it means increasing the table sizes and significantly reducing the white space between them.
Figure 5: Colours are difficult to differentiate for those that are color blind. Please consider differentiating the bars with different fill textures/patterns. Please also add a label for the y-axis on each subfigure and please include confidence intervals.
Figure 6: The text in all three sub-figures is illegible. Note the x-axis label for (a) has been cut off by sub-figure (c). The numeric values in (c) are completely illegible and should be removed to aid clarity.
Figure 7: Please label y-axis. lease increase font size on x-axis. Please adjust the y-axis so that the numeric values are consistent on both.
Figure 8: Please label y-axis and increase the font size of the text on the right-hand side.
Figure 9: Please adjust the yellow coloring as it is difficult to see with the white background.
Figure 10: Perhaps this is best presented as a whole page image in landscape format?

Experimental design

The aims of the paper clearly fall in the remit of the journal.

The research question is well defined, relevant and meaningful. The authors have described how they feel the results address the deficiencies of the knowledge base.

There are however some concerns over the scientific methodology.
- The feature selection approach in its current form is at risk of leading to overfitting of the classification models. The feature selection process should be carried out as part of the resampling method. The methodology should be revised and the experimentation repeated once this change has been implemented.
- More information is required on the pre-processing of the raw MetaHit sequence data. Are there any further preprocessing steps conducted to refine the metagenomic data?
- How were patients classed as having (or not having IBD)? How homogenous was the IBD patient cohort (and indeed the control cohort)?

- Line 218: How were the parameters optimized? In general, please provide parameters and where appropriate the names of functions used from toolboxes.
- Line 220: Please confirm if the train/test split was consistent across all models.
- Line 222: The intersection of features would also be interesting to examine as these are the features that consistently came up - are they the most useful?

Some minor points:
Lines 126-127: please include a citation to justify the claim "commonly used as a feature".
Lines 128-139: Please link back to the three types of features discussed in lines 122-124. Lines 122-124 are introduced but appear to be unused in the later discussion.

Please clearly explain what the 3302 columns in Figure 1 represent. I would also recommend in Figure 1 moving the "no of sick/healthy samples" from the right-hand side to the left-hand side so that we can see the 328 samples are split into 148 and 234. Currently, it looks like the text is associated with the 1st and 2nd row of data explicitly.

Validity of the findings

The primary concern here follows from the risk of overfitting caused by the methodological approach (as described in my comment above). The results should be revisited once the methodology is revised. There are some general comments which would be beneficial for the authors to consider in their revision.

Whilst the sample size is reasonable for this study, any findings from this work (with the revised methodology) would be significantly strengthened by validating the findings on a larger open-sourced dataset. Whilst this is substantial work it will result in more robust findings, support model comparisons in the future, and ease reproducibility, which will hopefully see the author's efforts rewarded through increased community adoption of their approach.

Line 250-251: Please comment on the biological relevance of the features contained in the intersection of all feature selection methods. Is XGBoost good because it identifies particularly interesting species or are the really interesting species contained within the 10 species of the intersection?
Line 270: Please elaborate on how the values 12 and 7 were identified. They are not immediately obvious from Figure 7.
Lines 263-281: Please comment on the role the isolated species are known to have in biology similar to the discussion presented in lines 258-262.
Line 298: Why cut off at the top 24 species? This seems arbitrary without further explanation.

Discussion Section:
Lines 325-326: The description of "careful feature selection" is perhaps misleading. The authors apply a number of feature selection methods, but what is the take-home message for the reader? Which feature selection method should be used? Is it important to select a particular feature selection method depending on the classification model to be used? Should a number of feature selection methods be used and the final features are determined by the union/intersection?
Lines 330 and 334: "unprecedented performance/results" - This is a bold statement and should be revised. The conclusion "feature selection can improve classification results" is a well-established finding so there is precedence that your results should be higher than previous studies. The performance increase is certainly good but I don't think unprecedented is the appropriate description.
Line 339: "leads to a framework" - perhaps the rephrase along the lines of "provides the foundation for a framework to be developed in future work" would be more appropriate.
Line 341: "would imply a potential for narrowing" - I don't think it is clear what is meant here. I think a rephrase would be helpful for the reader.
Line 352: "provides a blueprint". I think one of the following may be more appropriate: "provides evidence" or "provides justification" or "provides motivation" or "provides support".

Additional comments

The paper is well written and I would compliment the authors for some good practices, such as Figures 2 and 3 which clearly and concisely illustrate the concept outlined in the main discussion. The paper presents promising results however, further experimentation should be conducted to ensure the classification models are not inadvertently overfitted as a result of the feature selection approach outlined in the current manuscript.


Basic reporting

The reporting of this article is, to describe it briefly, unbalanced. It is written in clear, professional English, and there are sections of well developed and explained text, both in Materials and Methods and in the Experiments (according to PeerJ standards it should be named Results), but there are other sections that are clearly missing content, or the one written is contradicting itself. As an example, we do not need to look further than the Abstract, where we can find a claim that dysbiosis in the cause of IBD and other gastrointestinal illnesses, to a few lines later read that we do not have a clear picture of how this dysbiosis affects IBD, just that they show a strong association (this is also repeated in the Introduction). Claims of causality should be avoided unless evidence of this relationship is abundant, and with the gut metagenome we are just not there yet.
Getting into the Introduction, I find that some general description of what we know about the relationships between IBD and dysbiosis gut environments would be beneficial to highlight the dire need for this kind of diagnosis approaches. I would add a brief description of the most common symptoms of IBD and the relationship of between this illness and the gut to lines 51 to 57. The mention of the high comorbidity of IBD with other dysbiosis related illnesses, as depression, anxiety or obesity could also highlight the relevance of this research.
Regarding the figures, I do not have any major complaints. The tables in Figure 4 may be too small in some screens, so maybe they should be independent supplementary tables. Other than that, they are clear, well described and easy to interpret.
Raw data was supplied through the project’s website. This should still be disclosed on the paper.

Experimental design

The experimental design follows the trend of using different Machine Learning algorithms for classification of different atypical markers of illnesses. This is a powerful approach both to understand better which alterations in the microbiotic environment are most influential on IBD, and for clinical purposes as diagnostic and treatment.
The research question is well defined and relevant, and the investigation is, as far as I can see, performed according to technical and ethical standards.
In the description of the methods we find, again, this unbalance that we previously saw in the introduction and abstract. The first part of that section (lines 122-144) and the Feature Selection descriptions are good. Model Selection and Unsupervised Learning, however, are too brief and shallow. There is no description of what any of the classifiers used, nor a justification of their use. In fact, the Model Construction section is composed mostly of a description of the performance metrics used to analyze how well those algorithms performed, instead of why those algorithms were used and how they work. Something similar happens with the Unsupervised learning, where the description is so brief that a description of the algorithms has been included in the Experiments (Results).

Validity of the findings

To my knowledge the results presented in this paper are novel and relevant. I would like the authors to take a bit of time to try to link them to other research in IBD and metagenomics, mostly on the topics I already mentioned on my comments on Basic Reporting. IBD is a complex, multifactorial and comorbid illness, and that should be addressed both in the introduction and description of the issue, and on the discussion and interpretation of our results, even if the scope of our analysis does not include all those factors or comorbid conditions.
I must admit that I do not like to see some of the results, as the most relevant species for IBD diagnosis according to your analysis, firstly mentioned on the Discussion. Maybe restructuring the article according to PeerJ Standard Sections template would help avoiding this. I understand that the focus of this paper is on the use of Machine Learning as a diagnostic tool, but if you are going to mention it at all, it still should be mentioned in the Results section.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.