This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Kang D, Li F, Kirton ES, Thomas A, Egan RS, An H, Wang Z.2019. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ Preprints7:e27522v1https://doi.org/10.7287/peerj.preprints.27522v1
We previously reported MetaBAT, an automated metagenome binning software tool to reconstruct single genomes from microbial communities for subsequent analyses of uncultivated microbial species. MetaBAT has become one of the most popular binning tools largely due to its computational efficiency and ease of use, especially in binning experiments with a large number of samples and a large assembly. MetaBAT requires users to choose parameters to fine-tune its sensitivity and specificity. If those parameters are not chosen properly, binning accuracy can suffer, especially on assemblies of poor quality. Here we developed MetaBAT 2 to overcome this problem. MetaBAT 2 uses a new adaptive binning algorithm to eliminate manual parameter tuning. We also performed extensive software engineering optimization to increase both computational and memory efficiency. Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed. Binning a typical metagenome assembly takes only a few minutes on a single commodity workstation. We therefore recommend the community adopts MetaBAT 2 for their metagenome binning experiments. MetaBAT 2 is open source software and available at https://bitbucket.org/berkeleylab/metabat.
This is a submission to PeerJ for review.
A list of metagenome assemblies used to evaluate binning performance
IMG access IDs and Sample names for IMG-100 dataset
Very impressive work! Really neat implementation of a set of common sense innovations, can't wait to try it!
As discussed, I think adding an additional matrix in the form of taxonomic classifications would likely add further improvement, especially as more and more reference genomes (for example those compiled in GTDB) are becoming available. Taxonomic classification could easily be added into the graph based framework as encoding them as vectors is straightforward.
Thank you, Marc, this is a great suggestion! We are implementing sth. similar to taxonomic information (but hopefully more robust, as metagenome contigs may lack phylogenetic markers). We'll at least incorporate GTDB info into the machine learning framework we are implementing and access its contribution to binning accuracy.