To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
Thank you for making the requested revisions!
Thank you for this! The reviewers generally supported publication, with a few caveats. If you can address these I will take another look at it quickly.
In particular, please address concerns about normalization, the strength of conclusions are larger and more complex data sets (the HMP is hardly representative of environmental data sets), discuss scalability and I/O, and please specifically address what considerations might be needed when analyzing high diversity and low coverage data sets.
In the supplementary figure 1, why the running time becomes constant after k=17? The running time should grow exponentially with the kmer size. My interpretation is that the authors only picked 1M reads from each 20 samples. As a result the total number of kmers is about 20M which is much smaller than 4^17. I think the plot is unfair.
About a related issue in the same figure, when the data set is fixed, why the disk usage grows with the kmer size? Is it due to the difficulty of compression when there are more distinct kmers? Even so, I am surprised to see that the curve did not flat out until k>=51.
I’m wondering whether the sample size has an impact on the correlation, i.e. some samples have much more reads than other samples. Though some of the distance has normalization factor, can it normalize the factor of the data set size?
For example, in the BrayCurtis distance calculation, if one data set is just repeats of the other data set 100 times, then the distance will be close 1, where the distance should be 0.
In this paper, the author implemented a highly-parallel program Simka that can count the kmers from many metagenomic samples while computing the ecological distance with additive property between the samples. The authors did a thorough analysis showing that analyzing kmers gives similar results when using other more complex analyzing methods. Thus Simka can be applied to analyze large-scale metagenomic experiments.
The framework of Simka is solid. It is quite scalable with respect to time and memory footprint. However, Simka heavily uses disk and is not scalable with respect to disk usage.
In this manuscript, the authors reported the development of a method to compare metagenomic datasets based on k-mer counting. Not like some other tools, this tool - Simka can not only calculate the Bray-Curtis similarity, but also many other similarity metrics, which is nice. In this method, the k-mers abundance profiles across the metagenomic datasets are calculated. However taking advantage of the additive nature of computing some similarity metrics, the k-mers abundance profiles do not need to be stored and so is the huge k-mer count matrix, which reduces the disk usage. The authors demonstrated the benchmarking of Simka compared to other tools and compared the similarity measurement computed with Simka to that computed using other methods like sequences alignment and taxonomic profiling. Generally the manuscript was well written, with comprehensive description of the methods, data and pipeline for reproducibility. The software package repository is well organized on Github and it has good and clear documentation, which is very nice. There are some comments below about this manuscript though.
1. One of the most challenging problems in using k-mer counting to compare metagenomics datasets is how to deal with the k-mers from sequencing errors. As the authors mentioned in line 196, many of the k-mers with very low abundance come from sequencing errors. The solution of this method is filtering out those k-mers with abundance as 1, with those “solid k-mers” left. This works fine with metagenomics data set with higher coverage, as shown in the manuscript, with HMP samples as the testing dataset. It will be interesting to see how this method performs for other metagenomic datasets with lower coverage or higher diversity, like some environmental data sets. The IMG/M datasets used in COMMET paper and the Global Ocean Sampling datasets used in Comareads and Mash are two good candidates since in this manuscript, the authors compared the performance of Simka with COMMET and Mash. Also in line 475, the authors mentioned “Simka is able to capture such subtle signal raises hope of drawing new interesting biological insights from the data, in particular for those metagenomics project lacking good references (soil, seawater for instance).” and in line 528, “However, species composition based approaches are not feasible for large read sets from complex ecosystems (soil, seawater) due to the lack of good references and/or mapping scaling limitations. Moreover, our proposal has the advantage of being a de novo approach, unbiased by reference banks inconsistency and incompleteness.” It will be great if there are experiment results on those soil, seawater samples that can support these points.
2. In various parts in this manuscript, the authors mentioned that the solid k-mers filtering out does not affect results (line 376, 442, 489). This may be due to the high coverage of the HMP data sets. In the discussion in line 369-380, the authors mentioned that a small proportion (15%) of k-mers account for 95% of all base pairs of the whole dataset, which demonstrates that the HMP datasets have relatively higher sequencing coverage and most of the low abundance k-mers filtered out are probably sequencing errors. This may explain why the Simka results are robust with filtering (line 441) Just claiming that filtering out low abundance k-mers does not affect similarity measurement may not be accurate, at least before we see how this works for other environmental data sets with lower coverage recommended above. It will be nice if the authors can explain this more clearly.
3. Similarly, in line 490-493, the authors mentioned the filter can be disabled for samples with low coverage or where the rare species have more impact. But in this situation, how to deal with those large amount of erroneous k-mers? How will this affect the performance? In line 493-495, the authors claimed Simka is able to scale without solid k-mers filter. This may be true for the HMP data shown in the manuscript, but we still need to wait to see how it works for the low coverage data sets.
1. In Abstract- Methods, “Simka scales-up today metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets.” Should “today” be “today’s”?
2. Line 3-6, “In large scale metagenomics studies such as Tara Oceans (Karsenti et al., 2011) or Human Microbiome Project (HMP) (Human Microbiome Project Consortium, 2012a) most of the sequenced data comes from unknown organisms and their short reads assembly remains an inaccessible task”. But in line 330, “One advantage of this dataset(HMP) is that it has been extensively studied, in particular the microbial communities are relatively well represented in reference databases” The descriptions about HMP seem like a contradiction here.
3. Table2, “2X16G paired reads”. it may be better to be just “2 X 16 billion paired reads”. I am not quite sure if “G” can be used like this.
4. Line 433, “On the other hand, Mash distances correlate badly with taxonomic ones (r = 0.51, see the comparison protocol in Article S1). “ It will be nice to cite Figure S3 here.
5. Line133 “For example, experiments on the HMP (Human Microbiome Project Consortium, 134 2012a) datasets (690 datasets containing on average 45 millions of reads each) require a storage space of 630TB for the matrix KC. “ How does the “630TB” calculated? For this method, the k-mer counting matrix and k-mer frequency vectors do not need to be stored. But the frequencies of all the k-mers in each partitions across difference data sets (red squares in Figure 2) are still stored on the disk after the sorting count stage, right? If so, how different is it compared to storing the k-mer counting matrix?
6. What is the role of the GATB library in Simka? If the GATB does the actual sorting count, then the paragraph in this manuscript about sorting count may be condensed. Also the description about the work of Chao et al. (2006) can be more precise too.
7. The software package repository is well organized and with good documentation. Just while I was trying to run the example test with “./bin/simka -in example/simka_input.txt -out results -out-tmp temp_output”, it failed with “Illegal instruction: 4”. I was using simka-v1.3.0-bin-Darwin.tar.gz and on mac OS 10.11. It may be the problem on my side. But this may be good for the authors to know.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.