AnnotationBustR: An R package to extract subsequences from GenBank annotations

Department of Ecology & Evolutionary Biology, University of Tennessee - Knoxville, Knoxville, Tennessee, United States
DOI
10.7287/peerj.preprints.2920v1
Subject Areas
Bioinformatics, Evolutionary Studies, Genetics, Taxonomy
Keywords
Sequence Data, GenBank, ACNUC, R Package, Subsequences, DNA Barcodes, Phylogenetics, mtDNA, cpDNA
Copyright
© 2017 Borstein et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Borstein SR, O'Meara BC. 2017. AnnotationBustR: An R package to extract subsequences from GenBank annotations. PeerJ Preprints 5:e2920v1

Abstract

Background. DNA sequences are pivotal for a wide array of research in biology. Large sequence databases, like GenBank, provide an amazing resource to utilize DNA sequences for large scale analyses. However, many sequences on GenBank contain more than one gene or are portions of genomes, and inconsistencies in the way genes are annotated and the numerous synonyms a single gene may be listed under provide major challenges for extracting large numbers of subsequences for comparative analysis across taxa. At present, there is no easy way to extract portions from multiple GenBank accessions based on annotations where gene names may vary extensively. Results. The R package AnnotationBustR allows users to extract sequences based on GenBank annotations through the ACNUC retrieval system given search terms of gene synonyms and accession numbers. AnnotationBustR extracts portions of interest and then writes them to a FASTA file for users to employ in their research endeavors. Conclusion. FASTA files of extracted subsequences and accession tables generated by AnnotationBustR allow users to quickly find and extract subsequences from GenBank accessions. These sequences can then be incorporated in various analyses, like the construction of phylogenies to test a wide range of ecological and evolutionary hypotheses.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

R code, RData, and files used for performing timing trials of AnnotationBustR.

The commented R code, associated RData file, list of accessions for chloroplast genomes, metazoan mitochondrial genomes, and metazoan rDNA used for timing trials, and output of timings for extracting thirteen chloroplast coding sequences, thirteen mitochondrial coding sequences, and the three rRNA and two internal transcribed spacers of rDNA.

DOI: 10.7287/peerj.preprints.2920v1/supp-1