Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand,

These limitations motivated the development of a new

Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the

It is estimated that only a fraction of 10^{−24}–10^{−22} of the total DNA on earth has been sequenced (

In this context, it is more practical to ditch species composition altogether and compare microbial communities using directly the sequence content of metagenomic read sets. This has first been performed by using Blast (

In 2012, the Compareads method (

All these reference-free methods share the use of

Even if Commet and MetaFast approaches were designed to scale-up to large metagenomic read sets, their use on data generated by large scale projects is turning into a bottleneck in terms of time and/or memory requirements. By contrast, Mash outperforms by far all other methods in terms of computational resource usage. However, this frugality comes at the expense of result quality and precision: the output distances and Jaccard indexes do not take into account relative abundance information and are not computed exactly due to

In this paper, we present Simka. Simka compares

The contributions of this manuscript are three-fold. First we propose a new method for efficiently counting

The proposed algorithm enables to compute

Given _{1}, _{2}, _{i}, …, _{N}, the objective is to provide a _{i,j} represents an ecological distance between datasets _{i} and _{j}. Such possible distances are listed in

_{i,j} represents the number of times a _{j}.

All quantitative distances can be expressed in terms of _{S},

Name | Definition | _{Si} |
||
---|---|---|---|---|

Chord | ||||

Hellinger | ∑_{w}_{Si}( |
|||

Whittaker | ∑_{w}_{Si}( |
|||

Bray–Curtis | ∑_{w}_{Si}( |
1 − 2 |
||

Kulczynski | ∑_{w}_{Si}( |
|||

Jensen–Shannon | ∑_{w}_{Si}( |
|||

Canberra | − | |||

Chord/Hellinger | – | – | – | |

Whittaker | – | – | – | |

Bray–Curtis/Sorensen | – | – | – | |

Kulczynski | – | – | – | |

Ochiai | – | – | – | |

Jaccard | – | – | – | |

AB-Jaccard | – | – | – | |

AB-Ochiai | – | – | – | |

AB-Sorensen | – | – | – |

Actually, Simka does not require to have the full

The _{KC} = _{s}∗(8 + 4_{s} the number of distinct

However, a careful look at the definition of ecological distances (

To sum up, instead of computing the complete

The first step takes as input

Starting from N datasets of reads, the aim is to generate abundance vectors that will feed the ecological distance computation step. This task is divided into two phases:

Sorting Count,

Merging Count.

Each

As the number of distinct _{i} contains a specific subset of

(A) The sorting counting process, represented by a blue arrow, counts datasets independently. Each process outputs a column of

The Sorting Count phase has a high parallelism potential. A first parallelism level is given by the independent counts of each dataset.

Furthermore, to limit disk bandwidth and avoid I/O bottleneck, partitions are compressed. A dictionary-based approach, such as the one provided in zlib (

Here, the data partitioning introduced in the previous step is advantageously used to generate abundance vectors. The _{i}, are taken as input of a merging process. These files contain

In that scheme,

Distinct

This filter is activated during the count process. Only

Simka computes a collection of distances for all pairs of datasets. As detailed in the previous section, abundance vectors are used as input data. For the sake of simplicity, we first explain the computations of the Bray–Curtis distance. All other distances, presented later on, can be computed in the same way, with only small adaptations.

The Bray–Curtis distance is given by the following equation:

where _{Si}(_{i}. We consider here that _{i}∩_{j} if _{Si}(_{Sj}(

The equation involves marginal (or dataset specific) terms (i.e., ∑_{w∈Si}_{Si}(_{i}) acting as normalizing constants and crossed terms that capture the (dis)similarity between datasets (i.e., ∑_{w∈Si∩Sj}min(_{Si}(_{Sj}(_{i} and _{j}). Marginal and crossed terms are then combined to compute the final distance.

Algorithm 1 shows that it is straightforward to compute the distance matrix between

A matrix, denoted _{∩}, of dimension _{∩part} (step 3), in parallel. Each process iterates over its abundance vector stream (step 4). For each abundance vector, we loop over each possible pair of datasets (steps 5–6). The matrix _{∩part} is updated (step 8) if the _{i} and _{j} (step 7). Since a distance matrix is symmetric with null diagonal, we limit the computation to the upper triangular part of the matrix _{∩part}. The current abundance vector is then released. Each process writes its matrix _{∩part} on the disk when its stream is done (step 9).

When all streams are done, the algorithm reads each written _{∩part} and accumulates it to _{∩} (step 10–11). The last loop (steps 13–16) computes the Bray–Curtis distance for each pair of datasets and fills the distance matrix reported by Simka.

The amount of abundance vectors streamed by the MKC is equal to _{s}, which is also the total amount of distinct solid _{s} × ^{2}).

The distance introduced in _{i} and _{j} as: _{Si} is a marginal (i.e., dataset-specific) term of dataset _{i}, usually of size 1 (i.e., a scalar). In most distances, _{Si} is simply the total number of _{i}. By contrast, the value of _{Si}(_{Sj}(_{Si} and _{Sj} as well). For instance, for the abundance-based Bray–Curtis distance of _{Si} = ∑_{w∈Si}_{Si}(_{Si} are computed during the first step of the MKC which counts the

Qualitative distances form a special case of ecological distances: they can all be expressed in terms of quantities _{i} and _{j}, _{i} and _{j}. Those distances easily fit in the previous framework as _{w∈Si∩Sj}1_{{NSi(w)NSj(w)>0}}, _{Si} = ∑_{w∈Si}1_{{NSi(w)>0}} = _{Sj} =

In the same vein, _{i} shared with _{j}, by probabilistic “soft” ones: here the probability _{i} is also found in _{j}. Similarly, the “hard” fraction _{j} shared with _{i} is replaced by the “soft” probability _{j} is also found in _{i}. _{SiSj}∕_{Si} and _{SjSi}∕_{Sj} where _{SiSj} = ∑_{w∈Si∩Sj}_{Si}(_{{NSj(w)>0}} and _{Si} = ∑_{w∈Si}_{Si}(_{SiSj} corresponds to crossed terms and is asymmetric, i.e., _{SiSj} ≠ _{SjSi}. Intuitively, _{i} also found in _{j} and therefore gives more weights to abundant

_{i},

Finally, note that the additive nature of the computed distances over _{s}, the amount of distinct solid

Simka is based on the GATB library (

Simka is usable on standard computers and has also been entirely parallelized for grid or cloud systems. It automatically splits the process into jobs according to the available number of nodes and cores. These jobs are sent to the job scheduling system, while the overall synchronization is performed at the data level.

Simka is an open source software, distributed under GNU affero GPL License, available for download at

First, Simka performances are evaluated in terms of computation time, memory footprint and disk usage and compared to those of other state of the art methods. Then, the Simka distances are evaluated with respect to

We conduct our numerical experiments on data from the Human Microbiome Project (HMP) (

The scalability of Simka was first evaluated on small subsets of the HMP project, where the number of compared samples varied from 2 to 40. When computing a simple distance, such as Bray–Curtis for instance, Simka running time shows a linear behavior with the number of compared samples (_{i}, _{j}) such that _{Si} > 0 and _{Sj} > 0 whereas complex distances need to be updated for each pair such that _{Si} > 0 or _{Sj} > 0, entailing a lot more update operations. It is noteworthy that among all distances listed in

Each dataset is composed of two million reads. All tools were run on a machine equipped with a 2.50 GHz Intel E5-2640 CPU with 20 cores, 264 GB of memory. (A) and (B) CPU time with respect to

When compared to other state of the art tools, namely Commet, Metafast and Mash, we parameterized Simka to compute only the Bray–Curtis distance, since all other tools compute only one such simple distance. The

In summary, Simka and Mash seems to be the only tools able to deal with very large metagenomics datasets, such as the full HMP project.

Remarkably, on the full dataset of the HMP project (690 samples), the overall computation time of Simka is about 14 h with very low memory requirements (see

Simka was run on a machine equipped with a 2.50 GHz Intel E5-2640 CPU with 20 cores, 264 GB of memory, with

HMP-690 samples-3727 GB-2×16 billion paired reads | ||
---|---|---|

Without filter | With filter | |

Number of |
2471 × 10^{9} |
2331 × 10^{9} |

Number of distinct |
251 × 10^{9} |
111 × 10^{9} |

Number of distinct |
95 × 10^{9} |
15 × 10^{9} |

Memory (GB) | 62 | 62 |

Disk (GB) | 1,661 | 795 |

Total time (min) | 1,338 | 862 |

MKC-Count (min) | 758 | 573 |

MKC-Merge (min) | 148 | 77 |

Simple distances (min) | 432 | 212 |

Complex distances (min) | 8,957 | 4,160 |

These results were obtained with default parameters, namely filtering out

We evaluate the quality of the distances computed by Simka answering two questions. First, are they similar to distances between read sets computed using other approaches? Second, do they recover the known biological structure of HMP samples? For the first evaluation, two types of other approaches are considered, either

In this section, we focus on comparing Simka

Looking at the correlation with Commet is interesting because this tool uses a heuristic based on shared

Commet and Simka were both used with Commet default

Spearman correlation values are represented with respect to

Similarly, clear correlations (

These results demonstrate that we can safely replace read-based metrics by a kmer-based one, and this enables to save huge amounts of time when working on large metagenomics projects. Moreover, the

A traditional way of comparing metagenomics samples rely on so called taxonomic distances that are based on sequence assignation to taxons by mapping to reference databases. To compare Simka to such traditional reference-based method, we used the HMP gut samples, which is a well studied dataset comprising 138 samples. The HMP consortium provides a quantitative taxonomic profile for each sample on its website. These profiles were obtained by mapping the reads on a reference genome catalog at 80% of identity. From these profiles, we computed the Bray–Curtis distance, latter used as a reference. The complete protocol to obtain taxonomic distances is given in

Simka

On this density plot, each point represents one or several pairs of the gut samples. The

Interestingly, these Simka results are robust with the

We propose to visualize the structure of the HMP samples and see if Simka is able to reproduce known biological results. To easily visualize those structures, we used the Principal Coordinate Analysis (PCoA) (

PCoA of the samples is based on the quantitative Ochiai distance computed by Simka with

We conduct the same experiment on the 138 gut samples from the HMP project.

Distribution of the gut samples from the HMP project is shown in a PCoA of the Jensen–Shannon distance matrix. This distance matrix was computed by Simka with

In this article, we introduced Simka, a new method for computing a collection of ecological distances, based on

The distance computation has a time complexity in ^{2}), with

Since metagenomic projects are constantly growing, it is important to offer the possibility to add new sample(s) to a set for which distances are already computed, without starting back the whole computation from scratch. It is straightforward to adapt the MKC algorithm to such operation, but the merging step and distance computation step have to be done again. However, adding a new sample does not modify previously computed distances and only requires to compute a single line of the distance matrix, it can thus be achieved in linear time.

The motivation for computing a collection of distances rather than just one is two folds: different distances capture different features of the data (

A notable key point of our proposal is to estimate beta-diversity using

There is nevertheless room for improving Simka distances. For instance, recently,

Authors warmly thank, Olivier Jaillon and Thomas Vannier from Genoscope (CEA-IG-LAGE) and Stéphane Robin from INRA AgroParisTech for providing their technical, biological and statistical expertise, as well as feedback during the conception of this manuscript. We thank the GenOuest BioInformatics Platform that provided the computing resources necessary for benchmarking.

The authors declare there are no competing interests.

The following information was supplied regarding data availability:

Source code: