Thanks very much for this preprint. I think Anvi'o is a great way to visualise lots of different lines of information and evidence and I definitely want to use it in future genome projects.
As one of the authors of the biorxiv preprint questioning the Boothby et al results and methods Koutsovoulos et al 2015, I would like to clarify the following:
a. This paper should make it explicit somewhere that all the filtering / analyses on the Edinburgh assembly were on our initial, unoptimised, made-for-purposes-of-screening-only assembly (nHd.1.0). It does mention our final assembly (should preferably refer to it as version nHd.2.3 to avoid confusion) but doesn't show Anvi'o run on it and I don't think that is clear.
As it stands, the terms raw/curated/final/draft might be confusing for someone. And the terms could inadvertently imply that the curated version of our raw assembly (nHd.1.0) is the same as our final version (nHd.2.3, which is also called curated "These authors subsequently curated a 135 Mbp draft genome", even though ours was not just curated, but reassembled from filtered reads).
b. This paper could also acknowledge somewhere that the final 182 Mbp curated UNC/Boothby et al assembly was not optimised. The anvi’o process identifies which contigs are likely contaminants. But the remaining tardigrade-origin contigs are not optimised, by which we mean reassembled given more uniform coverage. Once we remove the contaminating reads from the data set, coverage-aware assemblers will do a better job as they will not be dealing with very low coverage bacterial genomes messing up median/mode coverage estimates. Our nHd.2.3 assembly is thus not simply a filtered subset of nHd.1.0, but a substantially improved assembly once contaminating reads were removed.
The rest of the paper is clear, and shows the utility of Anvi'o as an excellent visualisation tool. I especially liked the insights into the bacterial genome that was common to both samples.
This is obviously not a comprehensive review, but these points also jumped out at me:
1. "By applying two-dimensional scatterplots on their own assembly results (which were also contaminated with bacterial sequences), Koutsovoulos et al. reported a curated draft genome of H. dujardini"
This sentence gives the impression that our final assembly was as contaminated as the UNC assembly. We state in our preprint that we think the remaining contamination is on the order of a few stray fragments. Also, I think this statement implies that this was the only test for contamination in our paper (so perhaps this paper could include a phrase to suggest that the 2d scatterplot/blobplot was one of many tests?)
2. "A larger draft genome for H. dujardini.... This finding is in agreement with Koutsovoulos et al.’s findings; however, our curated draft genome is 47 Mbp larger than the draft genome released by Koutsovoulos et al. The portion of scaffolds covered by RNA-Seq data suggests that the additional 47 Mbp still originate from the tardigrade genome. Thus, our selection is likely to be a more complete draft genome for H. dujardini than that of Koutsovoulos et al., most probably due to Boothby et al.’s inclusion of longer reads.Regardless, long reads considerably improved Boothby et al.’s assembly..."
I think this paper correctly points out that approx 70 Mbp of the Boothby et al 252 Mbp assembly is contaminating sequence (and includes >96% proposed HGT). However, it suggests that our final nHd.2.3 assembly is less complete, on the basis of size alone. I would argue that our 135 Mbp genome is MORE complete than any subset of the 252 Mbp Boothby et al genome because 92.8% of RNA-seq reads map to ours vs 89.5 % of RNA-seq reads mapping to theirs (see Table 1 in bioRxiv preprint.)
It is possible that some of the contigs in the 182 Mbp curated Boothby et al assembly are longer and supersede contigs in the Edinburgh nHd.2.3 assembly. But we think our assembly is overall better "for our purpose" than the curated Boothby et al subset (our lab's goal is typically to discover what genes/genefamilies are present etc). Yes, we might have collapsed some repeats/haploid contigs. We don't claim we have the best assembly possible. We think it is a good assembly given short-read data. A reassembly with Boothby et al's longer reads and after filtering reads from contaminating bacteria would be better, obviously. However, Boothby et al themselves say neither Moleculo nor Pacbio improved their N50 much (it stayed around 15 kb, compared to approx 50 kb for nHd.2.3). In fact, they didn't use their PacBio data at all in the final assembly according to Supp Info.
Our best guess is that the 182 Mbp is an expanded genome (flow cytometry suggests 75-110Mb by Goldstein's own estimates and by T Ryan Gregory and by our flow cyto estimates). The extra assembled sequence is most likely because of better resolved repeat regions and uncollapsed haplo contigs (probably because of longer reads, we haven't checked). Ours is somewhat expanded too. But not as much.
3. "Although scatterplots can describe the organization of contigs in assembly results, they suffer from limited number of dimensions they can display, and their inability to depict complex supporting data that can improve the iden- tification of individual genomes. These limitations are particularly problematic in sequencing projects covering multiple sequencing libraries, where displaying map- ping results from each library can help detecting sources of contaminants. Despite their successful applications, two dimensional scatter plots limit researchers to the use of simple characteristics of the data that can be represented on an axis (such as GC-content). In contrast, clustering scaffolds, and overlaying multiple layers of independent information produce more comprehensive visualizations that display multiple aspects of the data."
Absolutely. Multiple-track based visualisations are fabulous for seeing lots of evidence at once for a particular region. However, I would suggest 2d scatterplots with additional seq identity info (i.e blobplots) and Anvi'o are complementary rather than one being better than the other - one lets you see the whole picture quickly/intuitively, while the other lets you identify specific cases using lots of evidence. That's certainly how we plan to use Anvi'o in future. Thanks for a terrific, well engineered tool.