To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
Thank you for thoroughly addressing the referees' comments in your revised submission. I am happy to accept it for publication.
I am satisfied with the replies of the authors and have no further comments.
The two reviewers are generally positive about your submission but require statistical analyses to support your conclusions and gauge the generalisability of your results. Reviewer #2's point on disentangling contamination from laterally transferred gene is pertinent and should be discussed and possibly even tested, but I appreciate that you might not be able to resolve it fully within the scope of this manuscript.
The article is well written and easy to understand.
The experimental design is very clear and describes primary research.
I am just wondering if some of the alignment hits are truly significant or can be expected by chance. Clearly the authors show that it should not be random. Still some pvalues would help.
The article deals with the problem of contaminating sequences. A topic that is currently not widely discussed but clearly has an impact. In this study, the authors show one way to screen for bacterial and viral contaminations. However, it would be also nice to have an easy and fast way to screen for other contaminations (eg. Human, Mouse). As far as I know their method (Kraken) should also be able to do this. In the following I will list some comments that I have:
1. Can you comment if the restriction of the database may not bias the analysis. Eg. micro satellites from human DNA might map to some viral forms. Is there a particular reason just searching for virus and bacteria contaminations? Since the search only takes some seconds...
2. May you comment shortly on the significant of some of your alignment matches. Clearly, one would not expect a 634bp region to be mapped to a Sheep. However, the alignment of some contigs might not be as unique given that you align it to the entire set of ncbi. This might become especially a problem if the contaminating organism is not sequenced yet.
3. I would suggest to extend the supplement Table by the pvalue or some other sort of indication how significant those alignments were. This is currently not easy to judge.
4. I think it would be a nice idea ( if possible) to provide the community with a pipeline that automatically screens their read / contig data for contamination. This should finally also include a human genome and other eukaryotes. Overall, you should have something like this ready.
The article is nicely written and easy to follow, although I find it frustratingly short (see below).
The description of the methods used is accurate and reproducible.
Although unlikely, the hypothesis that the some of the apparent contaminant sequences detected by the authors may actually be horizontal gene transfers should be discussed and tested. The ability of bacteria to take up DNA from the environment and integrate it in their genome is well known, and there have been many reports in recent years of horizontally acquired genes in eukaryotic genomes (see http://dx.doi.org/10.1002/bies.201300095). To complement their kmer analysis, the authors should look at other proxies (such as %GC and presence/absence of introns) for the origin of their putative contaminants: the genome of the bdelloid rotifer Adineta vaga, for instance, contains many genes that BLASTX and presumably Kraken would flag as bacterial contaminants, but which have introns, are syntenic with bona fide metazoan genes and have a %GC indistinguishable from the rest of the genome.
It is also a bit frustrating that the authors do not tell us anything about whether the problems they report are anecdotal or on the contrary very frequent among published genomes. I would therefore suggest that they include in their revised article a survey of a random selection of genomes analysed using the same approaches in order to come up with a quantitative estimate of the occurrence of contamination in genome databases.
Line 117-119: the authors mention that among the 67 genome segments that did not match any strain related to N. gonorrhoeae, 5 matched the genomes of cow or sheep, but it is not clear from the text what the origin of the 62 other segments may be. Does the sentence "None of these segments matched any other microbial species" apply to the 5 segments or to the whole set of 67 segments? In the latter case, does it mean that the 62 other segments had no match at all in the databases?
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.