To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
Thank you for making the changes!
Thanks for the revisions - everything looks good! Please fix the minor comments from both reviewers and I'll accept it without further review.
The addition of validated variant calls has greatly added to this manuscript.
Findings appear valid on this revision.
(1) Line 53: “…that, when given a reference genome, will generate…”
(2) Line 221-222: “…within which…”; awkward, re-word
(3) Line 240-242: reword “See supplemental data S1…”
(4) Line 278: “CFSAN SNP Pipeline”
(5) Line 332-334: Based on these results, should there be a minimum depth of coverage to use this program? I agree and appreciate that future developments should include a depth of coverage threshold
(6) Line 354: “and also”
(7) Line 355; reword sentence “Such a dataset….”
(8) Some discussion/re-iteration regarding the lack of detection of insertions and deletions is warranted as the program was not designed specifically to detect them.
(9) Table 2, “False positives”. It may be worth noting the total number of SNPs, insertions and deletions created by CFSAN SNP mutator.
(10) Table 3 can be supplementary.
The authors have sufficiently addressed my previous concerns, and written a new package to produce randomly mutated genomes.
Testing of the accuracy via randomly mutated genomes provides a measure of error, and source of the error is further characterized. CFSAN SNP mutator is a tool others could use to assess their own pipelines.
No further comment.
Table 2, check spelling of False Positive, rendered "P seositives" in my copy.
Consider deleting "simple" from the description of CFSAN SNP Mutator
Add the github information and availability information for CFSAN SNP Mutator to the end of the manuscript.
Hello -- all of the concerns of both reviewers are good and need to be addressed. Validation of some sort (not necessarily against GATK) and an explanation of choice of tools are particularly needed. Thank you!
This manuscript describes a custom pipeline designed to take fastq files through the mapping and variant-calling phases with one program. The concept behind the pipeline is of high importance as the analysis of next-gen datasets is currently cumbersome and requires multiple pieces of software. Additionally, the pipeline would be useful as additional samples became available in order to standardize the overall analysis.
Although the basic steps of the SNP pipeline are well-outlined and rely, for the most part, on currently available software (Bowtie2, samtools, Varscan), some custom scripts are included and therefore a validation of this pipeline on a publically-available SNP dataset is required. Although it is understood that different variant callers have various strengths and weaknesses, without a "gold standard", there is really no evidence that this pipeline will give accurate variant calls.
How will this pipeline be updated to reflect newer versions of the software used?
See comments above.
Although implied, it needs to be explicitly stated that this pipeline will remain freely available.
The paper and the software package feel a bit disconnected. The software package is everything I would like to see in an academic pipeline: source code available on GitHub, installation and usage well documented on readthedocs.org, and test cases distributed with it. However, the paper describing the software does not contain enough information to help the reader decide whether or not to use it. As the pipeline uses currently available tools and the current landscape of short-read aligners and variant callers is large and in flux, not enough discussion is devoted to the rationale behind the use of those tools.
Quality trimming: an obvious omission from the pipeline is trimming of the raw fastq reads for a quality cutoff. While [recent studies](http://journal.frontiersin.org/article/10.3389/fgene.2014.00013/) have shown that most data may in fact be over-trimmed, the same study shows no trimming leads to a large number of spurious k-mers. Was quality trimming omitted for the sake of speed? Was it found to have little effect on the final SNP pileup?
Explain the reasoning behind the "custom" parts of the pipeline. It is mentioned that Varscan's consensus caller is not used. Why not? What does this custom one do? I tried digging through the snppipeline.py and other code, but was unable to work out the logic in a timely fashion.
Explain the reasoning behind the choice of various tools. SAMtools and sra-toolkit are essentially without peer, but Bowtie2 is one of half a dozen commonly used short read aligners. Were any other tools (bwa-mem, novoalign, SOAP) tried? Is the emphasis on speed? Same for the choice of SNP callers. Was FreeBayes, SOAPsnp or something else considered? Why or why not?
While it is not free for all users, the GATK variant calling pipeline is widely used. How does this pipeline compare, both in terms of accuracy and run time?
A possible use of the pipeline mentioned is phylogenetic analysis of food borne outbreaks. To that end, the software seems to assume small, single chromosome organisms will be reference mapped. In the resulting snpma file, samples are ID'd by sample name only. It is possible to reconstruct which SNP's are in which position/chromosome using the snplist.txt file, but it is cumbersome.
The snpma file is referred to as a "SNP matrix" but its format appears to be a compressed multifasta .aln file containing only the positions that vary between samples. A search of the literature produced no standard for file formats relating to SNP matrices, but the most common is samples are rows and the called positions are columns labeled by position. As this is not the format of the snpma file, it should be described/explained. Also, an example pylogenetic treatment of the snpma test data in PHYLIP or some other package could be helpful.
I installed and ran the software on both the test lambda data and 7 strains of C. neoformans (SRR1063249, SRR1063250, SRR1063251, SRR1063252, SRR1063259, SRR1063260, SRR1063281) I am currently analyzing. The install process was relatively smooth, above average for a typical bioinformatics software package.
I may have misread the instructions, but it appears that the pipeline does not support multiple read files per sample. One of the reasons I only tested 7 C. neoformans data sets is that the rest have multiple read files per strain.
The name "SNP Pipeline" is too general. However, I am loathe to suggest yet another clever name for a bioinformatics tool. The name on the github and readthedocs pages is "CFSAN SNP Pipeline," I would suggest calling it that in the paper for the sake of people finding it.
Conclusion, line 132: "robust and accurate" is not supported by the manuscript as written. "robust and reproducible" is certainly true, however. Claims to accuracy require comparison to other workflows and pipelines, which are currently not presented.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.