CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Australian Centre for Ecogenomics, School of Chemistry & Molecular Biosciences, The University of Queensland, St. Lucia, Queensland, Australia
Institute for Molecular Bioscience, The University of Queensland, St. Lucia, Queensland, Australia
Advanced Water Management Centre, The University of Queensland, St. Lucia, Queensland, Australia
DOI
10.7287/peerj.preprints.554v1
Subject Areas
Bioinformatics, Genomics, Microbiology, Molecular Biology
Keywords
marker genes, genome quality, isolates, metagenomics, single-cell genomics, population genomes
Copyright
© 2014 Parks et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. PeerJ PrePrints 2:e554v1

Abstract

Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of ‘marker’ genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree along with information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities. CheckM is open source software available at http://ecogenomics.github.io/CheckM.

Author Comment

This manuscript has been submitted to Genome Research on Oct. 21, 2014.

Supplemental Information

Supplemental Figures and Tables

DOI: 10.7287/peerj.preprints.554v1/supp-1

Supplemental Table S13. Lineage-specific completeness and contamination estimates for isolate genomes from large-scale sequencing initiatives

DOI: 10.7287/peerj.preprints.554v1/supp-2

Supplemental Table S16. Lineage-specific completeness and contamination estimates for genomes annotated as finished at IMG, along with predicted translation tables and calculated coding density

DOI: 10.7287/peerj.preprints.554v1/supp-3

Supplemental Table S17. Lineage-specific completeness and contamination estimates for single-cell genomes from the GEBA-MDM initiative along with traditional assembly statistics

DOI: 10.7287/peerj.preprints.554v1/supp-4

Supplemental Table S18. Lineage-specific completeness and contamination estimates for population genomes, plasmids, and phage recovered from metagenomic datasets

DOI: 10.7287/peerj.preprints.554v1/supp-5

Supplemental Table S19. Completeness and contamination estimates for population genomes recovered from an acetate-amended aquifer

DOI: 10.7287/peerj.preprints.554v1/supp-6