To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
The revised manuscript can be accepted for publication in PeerJ.
The authors have addressed my requests satisfactorily. The paper can now be accepted for publication in PeerJ.
Both reviewers have a number of questions and suggestions which should be addressed in a revised version of the manuscript and be explained in a rebuttal letter.
Results and discussion should be merged.
More details are needed in the methods description.
Some more quality assessment is required. Data should be made available.
The manuscript of He et al. describes the construction of transcriptome assembly from Illumina RNAseq data of a tree species (Phoebe chekiangensis) endemic to China, and the use of this assembly for marker discovery. As such, the study constitutes an important resource for a non-model species. In general, I found the manuscript in most places clearly written, but have some concerns regarding the completeness of the methods description, the structure of the manuscript and data availability. I would also like to see some more quality assessment of the assembly
1 Data availability: the authors should make their raw reads available on the short read archive and deposit their assembly, predicted SNPs and SSRs, transcript annotation under permanent Digital Object Identifier (DOI), as issued for instance by DataDryad.
2 The “Discussion” section is not really a discussion, but presents new results, so the Results and Discussion should be merged into “Results and Discussion”.
3 I am not satisfied with the authors’ interpretation of the unannotated transcripts. The authors should map their RNAseq reads back to the final assembly to get a rough estimate on the expression level of the assembled transcripts. These results cannot be interpreted biologically (no replicates), but may give some insights into the nature of the unannotated transcripts. Probably the authors have already done the read mapping for SNP calling. This should be described in the Methods section
4 Describe in the Methods how Fig. 5 was generated.
5 I can’t follow the argument in ll. 239-243. The rate of false positives and false negatives will always be “large” (without specifying a threshold, large is meaningless), regardless of whether one or more tools are used. I think the authors want to discuss the low concordance between SNP callers. To be on the safe side, the authors should use only SNPs that are in common between different pipelines.
6 More generally, a considerable fraction of the predicted “SNPs” are nucleotide polymorphisms between orthologous regions in the parental haplotypes of a heterozygous individual, but probably alignment artifacts caused by misalignment The exact rate is difficult to determine. But the authors should discuss the issues because it may limit the usefulness of the predicted markers for genetic studies.
7 L. 109 “in-house perl scripts” Please make these available
8 Specify the versions of the software you used (Trinity, l.114; SOAPsnp, l.127)
9 How were open reading frames predicted?
10 The use of “endemic” in ll . 208 and 214 is wrong. Every species is endemic to someplace. So using “endemic” without specifying a place is meaningless.
11 The sentence ll. 223-224 is unclear. What is “heterogeneous chromosomal recombination?” Are you maybe referring the issues of apparent “SNPs” caused by misalignment of paralogs? (see 6.)
12 The 219-220 is not true anymore. It was so in 2002 when the cited reference came out. But nowadays, the most efficient way of SNP discovery involves some kind of NGS, so please rephrase and cite a recent review on SNP mining with NGS data.
Although for the most part the writing is clear there are several statements that lack clarity or appear to be innacurate. For instance, Line 50 "genetic mechanisms", do you really mean genetic mechanisms or do you mean genomic studies in general?, 2) Section 2.3 could use clarification. Base quality was calculated but not used? What exactly is meant by clean reads?
Also there are a number of typos such as on lines 77, 87 and 110 that need to be corrected.
Variant calling approach using a single individual needs more justification and qualification. Why do you expect P. chekiangensis to have high heterozygozity? What efforts were made to ensure that reported variants were not in fact the product of sequencing
How were the 100 SNPs that were targeted for validation selected?
The conclusion that the different SNP prediction software vary greatly in their outputs is not unexpected given the weak experimental design. Of the 25 SNPs that were validated by Sanger sequencing how many were called by each SNP prediction software. If the SNPs chosen for confirmation were selected on the basis of being part of the concordant set (those SNPs called by all three pipelines) or if the set selected was in some other way a "high confidence set", then we must suspect that an even greater proportion than 0.75 (75/100) are false positives and it must be clearly stated that although some of the variants reported may be real, the vast majority are expected to be false. Even the variants that were validated by Sanger could be false positives - two nearly identical sequences can be amplified in tandem by the same set of primers.
Although the largest transcript is reported to be 16300, the position (SNP locus) in Table 2 has numbers greater than 16300. If SNP locus is not the position in the transcript, then this information should be provided.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.