All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
I have assessed the revision and confirm that the authors have addressed the reviewers' comments. The revised manuscript is ready for publication.
[# PeerJ Staff Note - this decision was reviewed and approved by Brenda Oppert, a PeerJ Section Editor covering this Section #]
Please evaluate and compare MetaBAT-LR with other binners such as HiCBin:(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02626-w), hicSPAdes-binner (https://cab.spbu.ru/software/hicspades/) etc.
Page 11, lines 327-331: Add the appropriate references to support the statement.
Page 11, line 341: (as seen in the CAMI2 challenge) - What is CAMI2 challenge? Provide appropriate reference.
no comment
no comment
no comment
I appreciate the efforts of the authors in addressing my concerns.
no comment
Thanks for adding changes to the manuscript, the exposition seems to be better, however, I am little bit confused about some points.
For example, you're saying:
>1) All the command line parameters are listed in the materials and methods section
However, for BinSPreader you're saying that it was run "in default mode". However, default mode of BinSPreader does not utilize any additional paired-end connectivity.
Therefore, please do explicitly state all command-line options for all tools that are required for the reproduction of the results in a separate supplementary text. The necessary auxiliary files (e.g. assembly graph, scaffolds, binning tables, etc.) should also be available for the end users; please deposit them on e.g. Figshare or similar platform. You can see how similar supplementary files are deposited in MetaCoAg or BinSPreader papers.
Finally. I clearly understand the direct connection of authors to metabat tool. However, the following claim on lines 382-384 does not seem to be fair: "We anticipate that metaBAT-LR has the potential to improve the binning experiments of all metagenome binners since it is not dependent on the output of any specific binning tool (we demonstrated this with MetaBAT 2 as an example)". The documentation of the tool explicitly says that metabat2-produced bins should be utilized as input. In order to show the independence the authors should include at least one other binner to the study.
Thanks for running AMBER and providing some bits of its results. It would be helpful if AMBER resulting tables were available as-is as supplementary files and not just some chosen screenshots – they do contain much more useful information than showed in the paper. In addition to this, it's worth mentioning that some tools compared does not have explicit bin merging steps (this is at least true to bin3c and BinSPreader). Therefore the direct comparison might be misleading.
Please also include the initial binning stats to Figure 5.
1. For the reader with non-computational background, the authors should consider including a supplementary method with line-by-line commands and exact parameters used for the analysis. It will make their work more transparent.
2. Line 186-187; move to the method section.
3. Figure 4; this data should be in a table.
4. The authors should consider citing the appropriate references in the discussion section.
Authors are presenting a technique of metagenomics binning by incorporating the Hi-C information. Using Hi-C information is quite abundant in metagenomics assembly but not predominantly used in metagenomics binning, especially on Long reads.
However, long reads are not directly handled by the tool. Hence, the tool belongs to the domain of contigs/scaffolds binning. The tool name could be misleading with other tools such as Mega-LR/MetaBCC-LR which are long reads binners.
Presentation of the paper is concise. However, literature study is poor. Although some work has been cited, they have not been used in the evaluation of the performance.
In the self supervised machine learning section, authors discuss the features used for the training. Please present the features in a tabular format for clear presentation. Please justify the choice of feature scaling used.
For example, (1.0e+06 abs(log2(d1/d2))) does not make sense without a proper explanation. This is an assymmetric feature, order of d1 and d2 will affect the feature value. Also constant added seem to mask the log ratio, please explain the mathematics behind this (lines 142-148).
I am not fully convinced by the use of term "self-supervised". As this looks like a semi-supervised approach. Authors use labels from MetaBAT2 to train another model. Self-supervised approach should be able to train on its own using feature discrimination. If the method used Hi-C connectivity between scaffolds with assembly graph connections to train their model, that would push it towards a self supervised method. In this case, no labelled information was used. I suggest authors read more on this - https://en.wikipedia.org/wiki/Self-supervised_learning. A solid bioinformatics related example is training ML models where data points are trained using single copy marker genes (they should be in different bins) and assembly graph connections (connected contigs/scaffolds are more likely to be in the same bin). A good read on this would be https://doi.org/10.1609/aaai.v36i4.20388 (published in AAAI 2022).
Authors then combine the predicted "connectedness like" metric from the training with TNF vectors to decide which bins needs to be merged. Does this mean, the whole pipeline merge bins to improve completeness? Please have a look at the Das Tool program (https://www.nature.com/articles/s41564-018-0171-1). This could be a better baseline for benchmarking due to its popularity.
Please explain how the random forest predictions are combined with TNF. Also, please mention the probability threshold used to create the affinity graph.
Lines 159-160 is not clear. LPA algorithm results in a NxN probability matrix. This tells us the probability of a node (n) having a label (l). How did you use this to partition the graph?
lines 186-192 - higher test accuracy could potentiall mean an overfitting scenario. It might be more useful for the reader to evaluate the accuracy of this modelling. Since the entire pipeline relies on these result, it must be evaluated to ensure that binning result is not dominated by TNF values alone (this again highlights that authors must explain how they combined the Random forest and TNF vectors).
Benchmarks are performed on very old and not commonly used tools. Please evaluate against algorithmic tools like MetaWatt (https://www.frontiersin.org/articles/10.3389/fmicb.2012.00410/full), MaxBin 2 (https://academic.oup.com/bioinformatics/article/32/4/605/1744462), machine learning tools like VAMB/semi-bin and graph based tools like MetaCOAG (https://link.springer.com/chapter/10.1007/978-3-031-04749-7_5).
Cat Fecal Microbiome Dataset - this dataset has short-read sequencing. But the authors present MetaBAT-LR. I was under the impression that tool was developed to bin metagenomic long reads. But it seems it is a contigs binning tool and does not handle long reads at all.
The name of the tool is misleading. This does not use Long Reads as input. Hence cannot be pitched as a long reads tool.
Most popular contig binning tools are not included in the evaluation.
1. The section “Self-supervised machine learning” is key but difficult to follow. Rather than describe the many numerical representations only inline within the text, it would help readers to tabulate these entities. Additionally, I would have greatly appreciated a toy matrix representation.
2. As above, the section “Bin merging and scaffold recruitment” also is key and would benefit from the addition of succinct mathematical notation and slightly more explicit language. At present, it is necessary to refer to the codebase to understand with certainty what is being calculated. An example of explicitness would be (at line 154) “The percentage of <scaffold> pairs … “
3. Figure 1: Although a workflow diagram is a frequent feature of bioinformatics articles, I feel that figure 1 is currently somewhat trivial. As a visual aid to express the major contribution of this work, a second panel could be added that distils in more detail the core operations of the algorithm currently represented between “Random Forest Model” and “Improved Metagenome Bins”. This would also help to alleviate the descriptive shortcomings of the present Materials and Methods.
4. For each of the graphs mentioned in the manuscript, both nodes and edges should be clearly defined by the nature/composition of the entities used.
5. Line 158: Please clarify the set of scaffolds used to construct the graph, as it is left to the reader to assume “all scaffolds”.
6. Line 203: Please clarify whether these are percent increases or absolute values of completeness. That is, is it a 52% increase for Lactobacillus or has it now reached 52% completeness? If it latter, then readers will need the starting completeness value as well.
no comment.
Line 302: Regarding the failure to improve completeness for some bins within the real datasets, one contributing reason not mentioned might be the impact of excluding a significant portion of the metagenomic assembly when imposing a minimum scaffold length of 1500 bp. This would particularly exacerbate the binning of more complex communities, where total diversity and microheterogeneity combine to have a significant deleterious effect on N50.
No comment
Despite the name, metaBAT-LR is not a standalone binner. Instead, it is a binning refiner that uses Hi-C linkage information along with other contig features such as coverages, but otherwise relies on the quality of the initial binning (that is done by metaBAT). Therefore, metaBAR-LR should be compared not only with standalone binners, but also binning refiners, especially those that may use read-conectivity information, such as METAMVGL (Zhang et al, 2021) and BinSPreader (Tolstoganov et al, 2022).
Since it is a binning refiner, then the influence of the initial binning quality on the final binning should be assessed as well. I would suggest to use at least VAMB and MetaWRAP. For example, BinSPreader paper states that refining of VAMB binning of Zymo using paired-end reads and / or hi-c data results in almost 100% F1 score (calculated out of completeness and purity) of the binning, while refining of metaBAT bins was not so successful due to some contamination of the bins. I think this could be done relatively quickly as initial binnings (as well as gold standard bin assignment) could be taken from e.g. BinSPreader supplementary.
Also, why bin3c and proxymeta results on Zymo were not shown? I think all tools should be benchmarked uniformly similar.
Standard de-facto tool for MAG quality evaluation when the final result is known is AMBER (Meyer et al, 2018).
The paper would strongly benefit if AMBER results will be shown as otherwise the results shown are a bit patchy, for example, Figure S2 shows aligned lengths, but no genome sizes were provided. Figure 2 shows completeness and contamination separately, but it would be great to see them combines in F1 score. All these (as well as per-bin statistics, graphs and MIMAG completeness criteria) could be automatically obtained from AMBER results.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.