mockrobiota: a public resource for microbiome bioinformatics benchmarking
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology, Ecology, Microbiology
- Keywords
- mock community, rRNA, ITS, marker-gene sequencing, metagenomics, microbial ecology, microbiome, bioinformatics, methods development
- Copyright
- © 2016 Bokulich et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2016. mockrobiota: a public resource for microbiome bioinformatics benchmarking. PeerJ Preprints 4:e2065v1 https://doi.org/10.7287/peerj.preprints.2065v1
Abstract
Mock communities are an important tool for validating, optimizing, and comparing bioinformatics methods for microbial community analysis. We present mockrobiota, a public resource for sharing, validating, and documenting mock community data resources, available at https://github.com/caporaso-lab/mockrobiota. The materials contained in mockrobiota include dataset and sample metadata, expected composition data, which are annotated based on one or more reference taxonomies, links to raw data (e.g., raw sequence data) for each mock community dataset, and optional reference sequences for mock community members. mockrobiota does not supply physical sample materials directly, but the dataset metadata included for each mock community indicate whether physical sample materials are available (and associated contact information). At the time of this writing, mockrobiota contains 11 mock community datasets with known species compositions (including bacterial, archaeal, and eukaryotic mock communities), analyzed by high-throughput marker-gene sequencing. The availability of standard, public mock community data will facilitate ongoing methods optimizations; comparisons across studies that share source data; greater transparency and access; and eliminate redundancy. This dynamic resource is intended to expand and evolve to meet the changing needs of the ‘omics community.
Author Comment
This is a preprint submission to PeerJ.
Supplemental Information
Fig 1
Fig 1. Example usage of mockrobiota MC resource for marker-gene sequencing pipelines. MC datasets are selected based on multiple input criteria, including dataset metadata, sample metadata, and represented taxa. Raw data (e.g., fastq) are demultiplexed, sequences are dereplicated or clustered as OTUs, and taxonomy is assigned to representative sequences. Observed taxonomic assignments and abundances are compared to the expected composition (expected taxonomic assignments and abundances) of that MC, e.g., to generate precision and recall scores or correlations between observed/expected values.