Supplemental Information

peerj-preprints

PeerJ

PeerJ Preprints

2167-9843

PeerJ Inc.

San Francisco, USA

2378v1

10.7287/peerj.preprints.2378v1

Bioinformatics Computational Biology Genomics Microbiology Taxonomy

SLIMM: Species level identification of microorganisms from metagenomes

Dadi

Temesgen Hailemariam

temesgen.dadi@fu-berlin.de Renard

Bernhard

Wieler

Lothar H.

Semmler

Torsten

Reinert

Knut

1 Department of Mathematics and Computer Science, Freie Universität Berlin

Berlin

Germany 2 Max Planck Institute for Molecular Genetics

Berlin

Germany 3 Robert Koch Institute

Berlin

Germany 4 Department of Veterinary Medicine, Freie Universität Berlin

Berlin

Germany 5 Department of Mathematics and Computer Science, Max Planck Institute for Molecular Genetics

Berlin

Germany

19 8 2016

e2378v1

19 8 2016

2016

Dadi et al.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Identification and quantification of microorganisms is an important step in studying the alpha and beta diversities within and between microbial communities respectively. Both, identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than using 16S-rRNA sequences. However, shared regions of DNA among reference genomes and taxonomic units pose a significant challenge in assigning reads correctly to their true origins. The existing microbial community profiling tools commonly deal with this problem by either preparing signature-based unique references or assigning an ambiguous read to its least common ancestor in a taxonomic tree. The former method is limited to making use of the reads which can be mapped to the curated regions, while the later suffer from the lack of uniquely-mapping reads at higher (more specific) taxonomic ranks. Moreover, even if the tools exhibited generally good performance in calling the organisms present in a sample, there is room for improvement in calling the correct relative abundance of the organisms. We present a new method Species Level Identification of Microorganisms from Metagenomes (SLIMM) which addresses the above issues by using coverage information of reference genomes to remove unlikely genomes from the analysis and subsequently gain more uniquely-mapping reads to assign at higher ranks of a taxonomic tree. SLIMM is based on a few, seemingly easy steps which lead to a tool that outperforms state-of-the-art tools in run-time and/or memory usage while being on par or better in computing quantitative and qualitative information at the species level.

Metagenomics Microbial Communities Microorganisms Taxonomic Profiling NGS Data Microbiology

This work is supported by the International Max Planck Research School for Computational Biology and Scientific Computing and by the InfectControl 2020 Project (TFP-TV4). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

version

This submission is intended for the GCB2016 Conference Collection.

Supplemental Information

10.7287/peerj.preprints.2378v1/supp-1

Supplemental Information 1 Supplimentary table

Contains:

Details of datasets used for the study

Accuracy comparison of different methods per datasets

Runtime for each dataset

Statistical details (STDDEV, MEAN, Variance, Q1, Q2(median), Q3 ) of the difference b/n real and predicted abundance

10.7287/peerj.preprints.2378v1/supp-2

Supplemental Information 2 Figure S1: Precision - Recall Curves: SLIMM vs Existing Methods

Precision - Recall Curves: SLIMM vs Existing Methods across 8 different datasets

10.7287/peerj.preprints.2378v1/supp-3

Supplemental Information 3 Figure S2: Precision - Recall Curves: Different SLIMM variants

Precision - Recall Curves of Different SLIMM variants across 8 different datasets

10.7287/peerj.preprints.2378v1/supp-4

Supplemental Information 4 Figure S3. Violin Plots of the difference between real and predicted abundances: SLIMM vs Existing Methods

Violin Plots of the difference between real and predicted abundances: SLIMM vs Existing Methods across 8 different datasets

10.7287/peerj.preprints.2378v1/supp-5

Supplemental Information 5 Figure S4: Violin Plots of the difference between real and predicted abundances: Different SLIMM variants

Violin Plots of the difference between real and predicted abundances: Different SLIMM variants across 8 different datasets

10.7287/peerj.preprints.2378v1/supp-6

Supplemental Information 6 Figure S5: predicted vs real abundances: SLIMM vs Existing Methods

Predicted vs real abundances: SLIMM vs Existing Methods across 8 different datasets

10.7287/peerj.preprints.2378v1/supp-7

Supplemental Information 7 Figure S6: predicted vs real abundances: Different SLIMM variants

Predicted vs real abundances: Different SLIMM variants across 8 different datasets

Additional Information

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Temesgen Hailemariam Dadi conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Bernhard Renard wrote the paper, reviewed drafts of the paper.

Lothar H. Wieler wrote the paper, reviewed drafts of the paper.

Torsten Semmler wrote the paper, reviewed drafts of the paper.

Knut Reinert wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Data Deposition

The following information was supplied regarding data availability:

GitHub

URL:

https://github.com/temehi/slimm