FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares

View article
PeerJ

Introduction

Microbes are more abundant than any other cellular organism (Whitman, Coleman & Wiebe, 1998), and it is important to understand which organisms are present and what they are doing (Handelsman, 2004). In many environments a majority of the microbial community members cannot be cultured, and metagenomics is a powerful tool to directly probe uncultured genomes and understand the diversity of microbial communities by using only their DNA (Sharon & Banfield, 2013).

Understanding microbial communities is important in many areas of biology. For example, metagenomes can distinguish taxonomic and functional signatures of microbes associated with marine animals (Trindade-Silva et al., 2012) or disease states (Belda-Ferre et al., 2012). Large sequencing volumes, short read lengths, and sequencing errors make the task of identifying the diversity of organisms present in metagenomes challenging (Mande, Mohammed & Ghosh, 2012). Many programs exist for this and they are either homology- or composition-based.

Homology-based programs normally use the BLAST program (Altschul et al., 1997) to identify the best hit in a large database output. In MG-RAST (Meyer et al., 2008) sequences are aligned to a set of databases in order to classify the metagenomic sample. MetaPhlAn (Segata et al., 2012) and GenomePeek (K McNair, R Edwards, unpublished data) use a reduced database containing only marker genes, e.g., unique clades and housekeeping genes, allowing the BLAST search to be fast. PhymmBL (Brady & Salzberg, 2011) improves the BLAST results using interpolated Markov models. GASiC (Lindner & Renard, 2013) uses Bowtie (Langmead et al., 2009) and the reference genomes similarities to correct the observed abundance estimated. Parallel-Meta (Su, Xu & Ning, 2012) a fast program, which requires a GPU, uses megaBLAST (Zhang et al., 2000) and HMM (Hidden Markov Model) to improve the homology result. Most of these applications classify sequences individually, and generate a taxonomic profile by summing the bins.

In general, composition-based approaches use oligonucleotide (k-mer) frequencies. Taxy (Meinicke, Aßhauer & Lingner, 2011) uses oligonucleotide distributions in metagenomes and in reference genomes and uses mixture modeling to identify the organisms present in the metagenome, and RAIphy (Nalbantoglu et al., 2011) identifies organisms using oligonucleotides and relative abundance index.

We developed a new approach that reconstructs a taxonomic profile using an ensemble k-mer composition of the entire metagenome. We compute the optimal set of organism abundances using non-negative least squares (NNLS) to match the metagenome k-mer composition to organisms in a reference database. k-mers have previously been used to cluster unknown sequences (Teeling et al., 2004; McHardy et al., 2007) and NNLS has been used to identify the genera present in metagenomic samples based on variations in gene count (Carr, Shen-Orr & Borenstein, 2013). Here we combine these two approaches in FOCUS, an ultra fast, accurate, composition based approach to identify the taxa present in a metagenome. We compare the performance of FOCUS to GASiC, MetaPhlAn, RAIphy, PhymmBL, Taxy, and MG-RAST.

Methods

FOCUS workflow is described in Fig. 1. As in most composition-based approaches, a training set is pre-generated using the complete genomes information, and here the non-negative least squares (NNLS) is applied to compute the relative abundance of each organism in the database into the unknown data.

Workflow of the FOCUS program.

Figure 1: Workflow of the FOCUS program.

Reference dataset

FOCUS requires a group of reference genomes to model and identify the organisms present in a metagenome. 2,766 complete genomes were downloaded from the SEED servers (Aziz et al., 2012) on 20 December 2013 (see Table S1). k-mer frequencies (k = 6–8, default: k = 7) were calculated for both strands using Jellyfish 1.1.6 (Marçais & Kingsford, 2011), reducing the number of dimensions (Strous et al., 2012), and k-mer counts were normalized by the sum of frequencies. The user can also create their own training set, which is scalable to the quickly increasing number of available reference genomes because it also uses Jellyfish in the k-mer counting.

Simulated and real metagenomes

In order to evaluate FOCUS performance, a simulated dataset of short sequences (SimShort), containing 500,000 single 100 nt reads, was created using the supplied error model for Illumina GA IIx with TrueSeq SBS Kit v5-GA using GemSim (McElroy, Luciani & Thomas, 2012) (Table S2). The previous published high complexity simulated dataset (SimHC) from FAMeS (Mavromatis et al., 2007) was also used in the evaluation. Moreover, real metagenomic datasets were selected as test cases: one under healthy conditions, one under disease conditions (MG-RAST accession 4447943.3 and 4447192.3) (Belda-Ferre et al., 2012), one fecal sample from a healthy individual (MG-RAST accession 4440945.3) (Kurokawa et al., 2007), and three hundred datasets from the Human Micriobiome Project (HMP) (Consortium, 2012) (Table S3) were selected as a test case.

Non-negative least squares (NNLS)

The estimation of a parameterized model to understand some data is a fundamental problem in data modeling. Nevertheless, the estimation is not always easy, e.g., in problems like metagenome profiling that cannot have negative values for the fitted parameters. In such case, a solution can be estimated using NNLS, which is defined as:

Given a matrix A∈ℝmxn and a vector b∈ℝm, where mn, find a non-negative vector x∈ℝn to minimize the function (1). fx=1/2Axb2,wherex0andi=inxi=1.

In FOCUS, the reference matrix A is composed of m k-mer frequencies from n genomes, while a vector describing the user’s metagenomic dataset is calculated from the k-mer frequencies of both strands from the dataset using Jellyfish. FOCUS uses non-negative least squares to compute the set of k-mer frequencies x that explains the optimal possible abundance of k-mers in the user’s metagenome by selecting the optimal number of frequencies from the matrix A. We minimize the sum of squared differences (1) using the open source Scipy library (Jones, Oliphant & Peterson, 2001) which has a module for the NNLS algorithm which solves the KKT (Karush–Kuhn–Tucker) conditions (Lawson & Hanson, 1974). We added Tikhonov regularization (Garda & Galias, 2012) to deal with genomes that have similar k-mer compositions.

Jackknife resampling of the data

We implemented a jackknife resampling strategy to assess the robustness of the results. 50% of the reads were randomly resampled 1000 ×, and the species frequencies recalculated. For each species, these 1000 frequencies were averaged and the standard deviation calculated to estimate the spread.

Web-based and graphical user interface version

As an alternative to the command line version of the program, we have created a user-friendly web version and a graphical user interface (GUI) for Microsoft Windows users. The web server and the GUI are available at http://edwards.sdsu.edu/FOCUS.

Results and Discussion

Evaluation and comparison with other tools

All tools were run using default parameters and their default reference database, either online (MG-RAST) or using one core on a server with 24 processors × 6 cores Intel(R) Xeon(R) CPU X5650 @ 2.67 GHz and 189 GB RAM. We only compared GASiC to the SimHC dataset which had the results previously published (Lindner & Renard, 2013). We tried to run the tool; however, it requires a large amount of storage during the process to save its output data.

For the real data, three hundred and three metagenomic datasets were selected. First, the metagenomic sample of the human oral cavity from diseased conditions was used. MetaPhlAn apparently over predicted the genera Veillonella due to the short genome, and Taxy did not predict Prevotella hits (see Fig. 2) as described in Belda-Ferre et al. (2012). FOCUS was able to profile the organisms in only 41 s. Taxy took about 45 s, MetaPhlAn took about 3 min, RAIphy took 52 min, MG-RAST took 3 days, and PhymmBL took 1 week and 6 days. Using random subsets for the oral metagenome, we tested the tools scalability and showed that FOCUS and Taxy profile metagenomes in constant time (see Fig. 3).

Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a diseased human oral cavity using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Figure 2: Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a diseased human oral cavity using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Error bars represent the standard deviation uncertainty in tested metagenome.
Scalability test using different sub-sets of the human oral cavity under disease metagenome using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy.

Figure 3: Scalability test using different sub-sets of the human oral cavity under disease metagenome using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy.

The oral metagenome from the healthy condition was used. MetaPhlAn possibly over predicted the genera Neisseria, and Taxy was not able to predict Rothia hits (see Fig. 4). FOCUS profiled the metagenome in only 35 s. Taxy took about 41 s, MetaPhlAn took about 2 min, RAIphy took 48 min, MG-RAST took 3 days, and PhymmBL took 9 days.

Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a healthy human oral cavity using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Figure 4: Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a healthy human oral cavity using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Error bars show the standard deviation for the real metagenome.

A fecal metagenome from a healthy individual was used. All the tools predicted that Bifidobacterium and Enterococcus were the two most abundant genera in the sample. However, RAIphy apparently under predicted the genera Bifidobacterium (see Fig. 5). For this small dataset, FOCUS profiled the metagenome in 35 s. Taxy took about 40 s, MetaPhlAn took only 30 min, RAIphy took about 4 min, MG-RAST took 3 days, and PhymmBL took 2 days and 14 h.

Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a fecal metagenomic sample of a healthy human using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Figure 5: Genera-level taxonomy classification sorted by FOCUS prediction for the metagenome from a fecal metagenomic sample of a healthy human using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, and FOCUS (mean).

Error bars show the standard deviation for the real metagenome.

Three hundred metagenomic samples (254 GB total) from HMP were analyzed at all the taxonomy levels using FOCUS (Table S4) in about 1 h and 20 min and compared with the published results from MetaPhlAn’s paper (Segata et al., 2012) by calculating the Euclidean distance between the results (see Fig. 6). For most of the samples, FOCUS and MetaPhlAn have similar predictions at the genera level but vary at the species level. However, for some samples in the posterior fornix and most of the samples from the anterior nares there were differences at all levels which may reflect the additional genome sequencing of isolates from those passages that has occurred since 2012. Other tools were not included in the analysis due to the CPU processing time.

Heat-map representing the distance between the FOCUS and MetaPhlAn results for 300 metagenomes from the Human Microbiome Project across 15 body sites.

Figure 6: Heat-map representing the distance between the FOCUS and MetaPhlAn results for 300 metagenomes from the Human Microbiome Project across 15 body sites.

The distance was computed using the Euclidean distance between the results of both tools.

For the simulated data, we removed species from the reference dataset that are present in this dataset and tried to predict the genera present in the SimShort dataset. A major limitation of many of the approaches discussed here is that the underlying databases cannot be changed. Only FOCUS, RAIphy, GASiC, and PhymmBL allow the end user to change their reference database. FOCUS and PhymmBL best predicted the correct genera while RAIphy could not correctly predict their abundance (Fig. 7). FOCUS had the fastest performance (45 s); RAIphy took about 2 h, while PhymmBL took approximately 5 days. Figs. S1S5 show the same comparison for other taxonomy resolutions.

Genera-level taxonomy classification for the SimShort dataset using FOCUS, PhymnBL, RAIphy, and FOCUS (mean).

Figure 7: Genera-level taxonomy classification for the SimShort dataset using FOCUS, PhymnBL, RAIphy, and FOCUS (mean).

For the SimHC simulated metagenomes, the genera present in the dataset were deleted from the training dataset, and we evaluated the class-level prediction. The tested tools correctly predicted the classes, except that RAIphy over predicted the top two classes (see Fig. 8). Again, FOCUS was the fastest tool (30 s) in comparison to RAIphy, which took about 1 h and 50 min, and PhymmBL, which took about 4 days. See Figs. S6S8 for other taxonomy levels.

Class-level taxonomy classification for the SimHC dataset using FOCUS, PhymnBL, RAIphy, and FOCUS (mean).

Figure 8: Class-level taxonomy classification for the SimHC dataset using FOCUS, PhymnBL, RAIphy, and FOCUS (mean).

Furthermore, for the SimHC dataset, we ran all the previously used tools and the GASiC published results to evaluate the genera-level prediction. GASiC and PhymmBL had the best predictions, and FOCUS failed in the prediction of 4 minor genera probably because many organisms present in the SimHC dataset were not included in the FOCUS database (see Fig. 9). We did not compare the running time because we extracted the GASiC results from its paper; however, in the original paper it took 2 days and needed at least 500 GB of storage to analyze the SimHC simulated metagenome.

Genera-level taxonomy classification for the SimHC dataset using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, GASiC, and FOCUS (mean).

Figure 9: Genera-level taxonomy classification for the SimHC dataset using FOCUS, MetaPhlAn, MG-RAST, PhymnBL, RAIphy, Taxy, GASiC, and FOCUS (mean).

The very small standard deviations observed after jackknife re-sampling indicate the robustness of our results. Furthermore, in order to show a quantitative evaluation between the real and predicted abundance for the synthetic metagenomes, we computed the Euclidean distance between the real and predicted abundances for all the simulated data presented above (see Fig. 10). For some of the tools, only genus level predictions are available, but for RAIphy, PhymmBL, and FOCUS we included all taxonomic levels. The data demonstrate that FOCUS had the best prediction in more than half of test cases.

Numerical evaluation between the real and predicted abundance for the synthetic metagenomes computed by the Euclidean distance between the real and the predicted values.

Figure 10: Numerical evaluation between the real and predicted abundance for the synthetic metagenomes computed by the Euclidean distance between the real and the predicted values.

These tests were performed on a server; however, FOCUS is also ultra fast on a simple computer. For example, we profiled the real dataset in 1 min and 45 s using an Intel(R) Core(TM) i3 @2.53 GHz and 1 GB RAM. In addition to the Web server, we have developed a stand-alone version that runs on the Windows® platform.

Limitations

As with other methods created to profile metagenome sequences, FOCUS depends on a curated database of microbial reference genomes in order to predict a specific genus. If a reference genome is absent, the tool will predict the closest reference available.

Conclusions

Here we present FOCUS, an agile solution to identify the organisms present in metagenomic samples that does not rely on mapping individual reads, but instead determines the taxonomic composition of the entire metagenome at once by using NNLS. This makes FOCUS an extremely fast and scalable solution to profile the focal taxa in a metagenome. FOCUS reports very similar species compositions as currently available, state of the art metagenome profiling tools.

Availability and requirements

Project name: FOCUS

Project and web server home page: http://edwards.sdsu.edu/FOCUS

Operating system: the program has a command line version that works on OS X and Unix, and a GUI for Microsoft Windows users.

Programming language: Python 2.7.

Other requirements: Numpy (http://www.numpy.org), Scipy (http://scipy.org), Jellyfish (http://www.cbcb.umd.edu/software/jellyfish), and Python programming language (http://www.python.org).

License: GNU GPL3.

Any restrictions to use by non-academics: no special restrictions.

Supplemental information

Complete list of complete genomes present in the training set

DOI: 10.7717/peerj.425/supp-1

Complete list of organisms and abundances for the SimShort test set

DOI: 10.7717/peerj.425/supp-2

Complete list of three hundred metagenomes from the Human Micriobiome Project (HMP) used as test set

DOI: 10.7717/peerj.425/supp-3

FOCUS prediction for in all the levels for 300 metagenomes from the Human Microbiome Project

DOI: 10.7717/peerj.425/supp-4

Supplementary_Data with supplementary results

DOI: 10.7717/peerj.425/supp-5
80 Citations   Views   Downloads