Robust and automatic definition of microbiome states

View article
Microbiology

Main article text

 

Introduction

Materials and Methods

Automating the identification of granular, yet robust states

  1. Choosing the k number of clusters (from 2 to 10) with the highest average Silhouette width (SI) among all combinations of pairs of beta diversity measures, with the score being above the SI threshold (0.25), and

  2. Checking if that k value also passes the Prediction Strength (PS) threshold (0.80) for robustness, or

  3. Confirming if those k clusters are stable according to the Jaccard threshold (0.75) from a bootstrapping process

Clustering approaches

Beta diversity metrics

Clustering assessment scores

  1. SI: average Silhouette width: first, we search for the k number of clusters with the best SI, with k limited to the range of 2 to 10. This first step parallels the study of (Gajer et al., 2012), in which they also attempt to computationally define microbiome states. Here, we compute the average of all possible combinations of SI values for two different distance measures and each k, selecting the one with the highest average. The average SI must be greater than 0.25 in all selected measures, as this is the minimum threshold for “sensible” clusters according to Rousseeuw (1987). This score takes into account the similarity between samples in the same and in the nearest clusters. If the selected pair of distance measures does not outperform the robustness constraint (see below), the next best combination is checked. SI was chosen as the means of selecting the best number of clusters because it is a standard, widely used metric to evaluate the quality of a grouping resulting from a clustering algorithm application; its utility is independent of the data source; and it can be used in absence of a gold standard as will be common in novel microbiome datasets.

  2. PS: Prediction Strength: Although Koren et al. (2013) indicate that clustering selection could be restricted just to SI score for small datasets, we include an alternative, PS, which is also used in Koren et al. (2013). Our method runs 100 repetitions where the dataset is split into two halves and clustering is applied on both. We then search for a correspondence between both groups of clusters, classifying a point in one half to the cluster in the other half, and vice-versa. Each pair is considered well-classified if both points classify to the same cluster in the other half. The score is the frequency of matching classification pairs. Tibshirani & Walther (2005) recommend that the selected number of clusters has a prediction strength score above 0.8.

  3. Jaccard similarity: though the computation of PS implies some kind of bootstrapping (a resampling technique), our methodology allows an alternative, explicit step to verify the stability of the clusters selected based on the previous scores. This bootstrapping consists of a resampling with replacement, where clustering is computed over the whole dataset and, in addition, 100 resamples. Since the Jaccard score compares groups of elements, we compute the similarity of the original cluster with each resampling cluster. Thus, the resulting similarity score is the mean of the size of the intersection divided by the size of the union of samples. Following the guidelines for interpretation of Jaccard similarity  (Hennig, 2007; Hennig, 2008), a stable clustering should yield a Jaccard similarity value of 0.75 or higher.

Selected microbiome datasets

Results

Microbiome state determination including all taxa

  1. The first graph shows the results of the algorithm’s attempt to choose the most suitable number of clusters, k, according to distinct beta diversity measures (i.e., distance among samples) scored according to SI. The selected k value (from 2 to 10) must report the highest average SI in the best pair of two beta diversity measures. For example, in Fig. 1A, the number of clusters selected would be k = 3, since that is where the algorithm detects an SI score with the highest value (0.602; utilizing the PAM algorithm, and the JSD metric). The other metrics are also higher than the minimum threshold, with Morisita-Horn being the second best metric, scoring above the threshold for strong clusters.

  2. The second graph shows the test of whether the k value chosen by SI in the first graph is sufficiently robust, by testing it against the PS criterion of being greater than 0.8. The second graph of Fig. 1A shows that JSD and rJSD in PAM with k = 3 satisfies the robustness test, providing PS scores of 0.950 and 0.935 respectively. At k = 4, however, the PS value decreases to below the acceptable threshold for all metrics, thus showing that, despite k = 4 being an acceptable number of clusters based on the SI, classification of the data into four clusters is not robust.

  3. Finally, the third column graph shows the stability of the selected k clusters, by testing if the Jaccard similarity of the chosen diversity measures exceed 0.75. The third graph of Fig. 1A verifies the stability criterion, with Jaccard = 0.986 for the previously suggested k = 3 clusters for JSD and, in this case, also for all remaining beta diversity metrics.

Association of states with biological phenomenon

Determining the source of the robust state calls

Relationship to other definitions of microbiome “state”

Comparison to related approaches

Exemplar application—state sequence diagram

Discussion

Conclusions

Data Citation

Human gut microbiome (David et al., 2014):

Chick gut microbiome (Ballou et al., 2016):

Infant gut microbiome (La Rosa et al., 2014):

Lake microbiome (Dam et al., 2016):

Vaginal microbiome (Gajer et al., 2012):

  • OTU table and metadata: Gajer P. et al. Science Table S2 (2012)

  • Raw data: SRA SRA026073 (2012).

Left and right, tongue microbiome (HMP) (Caporaso et al., 2011):

Vaginal microbiome—Community groups (Ravel et al., 2011):

Supplemental Information

Robust clustering evaluation using the HCLUST algorithm on 8 datasets including all taxa

DOI: 10.7717/peerj.6657/supp-1

State time series diagram for both subject from David et al. (2014) dataset

Samples sorted by time. Grey points represents no additional time point for that subject.

DOI: 10.7717/peerj.6657/supp-2

Phylum level taxonomic distribution per identified state found in LaRosa et al., 2014 dataset

DOI: 10.7717/peerj.6657/supp-3

Microbiome clusters from Chicken Gut with varying filters and aggregations, represented as Principal Coordinates graphs

(A–C): Species level data. (D–F): Genus-level aggregation.

DOI: 10.7717/peerj.6657/supp-4

Phylum level taxonomic distribution per identified state found by our approach in subject A from David et al. (2014) dataset

Samples sorted by collection day or grouped by clusters.

DOI: 10.7717/peerj.6657/supp-5

Phylum level taxonomic distribution per identified state found by our approach in subject B from David et al. (2014) dataset

Samples sorted by collection day or grouped by clusters.

DOI: 10.7717/peerj.6657/supp-6

Comparison between microbiome states defined in the current study vs those ones defined in Gajer et al. (2012)

DOI: 10.7717/peerj.6657/supp-7

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Beatriz García-Jiménez conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Mark D. Wilkinson conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, approved the final draft.

Data Availability

The following information was supplied regarding data availability:

Our algorithm, implemented in R, is freely available at GitHub: https://github.com/wilkinsonlab/robust-clustering-metagenomics.

Our output data files are available at Zenodo: 10.5281/zenodo.1485916, including a set of graphs, and the <sample, state > annotations for downstream study, in a text file and a R phyloseq object.

Funding

Mark D. Wilkinson is funded by the Ministerio de Economía y Competitividad grant number TIN2014-55993-RM, and by the Isaac Peral programme of UPM. Beatriz García-Jiménez is funded through an award from the Severo Ochoa programme of the CBGP UPM-INIA Severo Ochoa Center of Excellence, Madrid. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

11 Citations 3,368 Views 807 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more