mockrobiota: a public resource for microbiome bioinformatics benchmarking

Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, USA
Department of Biology, Tufts University, Medford, MA, USA
Department of Microbiology & Immunology Department, Microbiome and Disease Tolerance Centre, McGill University, Montreal, Quebec, Canada
Division of Biological Sciences, University of California, San Diego, San Diego, CA, United States
​Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, United States
​Department of Microbiology and Immunology, G.W. Hooper Foundation, University of California, San Francisco, San Francisco, CA, United States
​Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, United States
​Department of Pediatrics, University of California, San Diego, La Jolla, CA, United States
Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, United States
DOI
10.7287/peerj.preprints.2065v1
Subject Areas
Bioinformatics, Computational Biology, Ecology, Microbiology
Keywords
mock community, rRNA, ITS, marker-gene sequencing, metagenomics, microbial ecology, microbiome, bioinformatics, methods development
Copyright
© 2016 Bokulich et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Bokulich NA, Rideout JR, Mercurio WG, Wolfe B, Maurice CF, Dutton RJ, Turnbaugh PJ, Knight R, Caporaso JG. 2016. mockrobiota: a public resource for microbiome bioinformatics benchmarking. PeerJ Preprints 4:e2065v1

Abstract

Mock communities are an important tool for validating, optimizing, and comparing bioinformatics methods for microbial community analysis. We present mockrobiota, a public resource for sharing, validating, and documenting mock community data resources, available at https://github.com/caporaso-lab/mockrobiota. The materials contained in mockrobiota include dataset and sample metadata, expected composition data, which are annotated based on one or more reference taxonomies, links to raw data (e.g., raw sequence data) for each mock community dataset, and optional reference sequences for mock community members. mockrobiota does not supply physical sample materials directly, but the dataset metadata included for each mock community indicate whether physical sample materials are available (and associated contact information). At the time of this writing, mockrobiota contains 11 mock community datasets with known species compositions (including bacterial, archaeal, and eukaryotic mock communities), analyzed by high-throughput marker-gene sequencing. The availability of standard, public mock community data will facilitate ongoing methods optimizations; comparisons across studies that share source data; greater transparency and access; and eliminate redundancy. This dynamic resource is intended to expand and evolve to meet the changing needs of the ‘omics community.

Author Comment

This is a preprint submission to PeerJ.

Supplemental Information

Fig 1

Fig 1. Example usage of mockrobiota MC resource for marker-gene sequencing pipelines. MC datasets are selected based on multiple input criteria, including dataset metadata, sample metadata, and represented taxa. Raw data (e.g., fastq) are demultiplexed, sequences are dereplicated or clustered as OTUs, and taxonomy is assigned to representative sequences. Observed taxonomic assignments and abundances are compared to the expected composition (expected taxonomic assignments and abundances) of that MC, e.g., to generate precision and recall scores or correlations between observed/expected values.

DOI: 10.7287/peerj.preprints.2065v1/supp-1