This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Exogenous sequence contamination presents a challenge in first-draft genomes because it can lead to non-contiguous, chimeric assembled sequences. This can mislead downstream analyses reliant on synteny, such as linkage-based analyses. Recently, the Mojave Desert Tortoise (Gopherus agassizii) draft genome was published as a resource to advance conservation efforts for the threatened species and discover more about chelonian biology and evolution. Here, we illustrate steps taken to improve the desert tortoise draft genome by removing contaminating sequences—actions that are typically carried out after the initial release of a draft genome assembly. We used information from NCBI’s Vecscreen output to remove intra-scaffold contamination and trim heading and trailing Ns. We then reordered and renamed scaffolds, and transferred the gene annotation onto this assembly. Finally, we describe the tools developed for this pipeline, freely available on Github (https://github.com/thw17/G_agassizii_reference_update), which facilitate post-assembly processing of other draft genomes. The new gopAga1.1 genome has an N50 of 251 kb, L50 of 2592 scaffolds, and its annotation retains 17,201 of the original 20,172 genes that were unaffected by the scaffold processing.