Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

Department of Medicine, University of Chicago, Chicago, Illinois, United States
Josephine Bay Paul Center, Marine Biological Laboratory, Woods Hole, Massachusetts, United States
DOI
10.7287/peerj.preprints.1695v1
Subject Areas
Bioinformatics, Genomics, Microbiology
Keywords
genomics, assembly, curation, visualization, contamination, anvi'o, HGT
Copyright
© 2016 Delmont et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Delmont TO, Eren AM. 2016. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies. PeerJ PrePrints 4:e1695v1

Abstract

High-throughput sequencing provides a fast and cost effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini using approaches routinely employed by microbial ecologists who reconstruct bacterial and archaeal genomes from metagenomic data. We created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

Author Comment

This study is currently under peer-review, and we wished to make our pre-print available to the community the way it was sent to reviewers. On the other hand, if we were not committed to provide a pre-print that is identical to the version under peer-review, we would have revised it to better clarify one important point: We are not suggesting that the tardigrade genome from our curation of the Boothby et al. assembly represents a better genome than the genome curated by Koutsovoulos et al. The most likely explanation for the 47 Mbp size difference between the two genomes is the better resolved repeat regions due to the inclusion of Moleculo reads in Boothby et al.'s analysis. As two research parasites, we have the utmost respect to all of the research groups who raised funds, performed experiments, generated data, and made them publicly available.