To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
Thank you for submitting a fine MS to PeerJ. Both of the referees suggested accept with minor revision. Based on the reading of your paper, I also recommend accept, but I would like you to implement the corrections/suggestions of both reviewers. Also agreeing with Deren Eaton, your software has much broader applicability than metagenomics, and givens its open nature is likely to be used even more outside this field. So it is worth mentioning this in your MS.
Congratulations on a an excellent piece of software.
I really enjoyed reading this ms. The ms is very clearly written, with only minor edits suggested.
The authors should be congratulated on a very well designed program, which will be very valuable to the community. The validation is robust and I really can't find a major fault. I am really impressed with some of the new features e.g. format detection, filtering etc. The FASTQ file format conversion is especially welcome and overcomes a primary USEARCH issue related to FASTA files.
The methods are robust and very sound.
I wish to thank the authors for this ms. I think the program is impressive and will be really valuable to the community. There are really (very) minor language issues.
L253: to recreate datasets AS...
L271: SRC number? remove IS
L285: Italicise "fast_convert"
L340: version 7 have ...
The manuscript “VSEARCH: a versatile open source tool for metagenomics” by Rognes et al. introduces new software for analysis of genomic data that is freely available and distributed open source. It was developed as a drop-in replacement for the commonly used software usearch, which is and has been closed source. By making their software freely available and open-source the authors have exposed and clearly described the algorithms and underlying processes that are inherent in the computational functions of usearch which hundreds of biologists have been using for many years now. This is a major development that is of great benefit to the computational and bioinformatics community.
The manuscript appears to adhere to all PeerJ policies, it is clearly written, the analyses are rigorous, and the results are well explained.
The analyses are rigorous and well explained. All scripts and data to reproduce analyses are available online.
The number of tests that could be performed to compare vsearch and usearch is of course too great for the authors to be exhaustive. They do a good job of providing common use cases on typical data sets, and they compare accuracy and speed.
Below are some additional questions or recommendations:
In the section Performance enhancement the version number of usearch7 and usearch8 is given but it is not clear whether the freely available 32-bit version is being used, or the paid 64-bit version. Whichever one was used, the author's should add a sentence about how the analysis comparison results would vary if the other version (32 or 64-bit) were used instead.
Line 208: Provide reference for "simple center-star sequence alignment"
Line 396: I understand that the authors use the name for the command argument "cluster_smallmem" to maintain compatibility with usearch, but the name seems to be a misnomer, as I believe it just means that the sequences are in whatever order the user supplies. Perhaps you could add "(custom order)" after the command to make it clear that the argument name itself is not necessarily meaning what it looks like.
Line 398: Add a short description of what the "Rand Index" is, and what "recall" is measuring, etc.
Figure 4: The clustering comparison does not use the "fulldp" option even though the authors explain that their implementation of this is better than that of usearch. I believe the vsearch implementation using fulldp is much faster, if I remember right (I use it in my own research). That would contradict the general statement that usearch is 2-3 times faster than vsearch at clustering (line 409).
On a related note, the authors describe their software in the title and introduction as being focused on metagenomics and microbiome research, but usearch/vsearch are in fact used commomly outside of this scope as well, such as for filtering, merging, and de novo clustering of short read sequences for population genetic and phylogenetic analyses of known sampled species (non-meta-sampling). Examples include the pyrad & ipyrad software which uses vsearch to identify homology within and between species of anonymous RAD-seq data. I'm not requesting that the authors cite any particular software, but it may be worth mentioning in the introduction that vsearch has many applications outside of metagenomics.
Line 442: Provide reference or link for Greengenes and SILVA
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.