DNA barcode data accurately identify higher taxa

Jonathan A Coddington; Ingi Agnarsson; Ren-Chung Cheng; Klemen Čandek; Amy Driskell; Holger Frick; Matjaž Gregorič; Rok Kostanjšek; Christian Kropf; Matthew Kweskin; Tjaša Lokovšek; Miha Pipan; Nina Vidergar; Matjaž Kuntner

doi:10.7287/peerj.preprints.1633v1

DNA barcode data accurately identify higher taxa

Jonathan A Coddington¹, Ingi Agnarsson², Ren-Chung Cheng³, Klemen Čandek³, Amy Driskell¹, Holger Frick⁴, Matjaž Gregorič³, Rok Kostanjšek⁵, Christian Kropf⁶, Matthew Kweskin¹, Tjaša Lokovšek³, Miha Pipan⁷, Nina Vidergar³, Matjaž Kuntner ^1,3

1 National Museum of Natural History, Smithsonian Institution, Washington, D.C., United States

2 Department of Biology, University of Vermont, Burlington, Vermont, United States

3 EZ Lab, Institute of Biology, Research Centre of the Slovenian Academy of Sciences and Arts, Ljubljana, Slovenia

4 Department of Invertebrates, Natural History Museum Bern, Bern, Switzerland

5 Department of Biology, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia

6 Natural History Museum Bern, Bern, Switzerland

7 Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom

DOI: 10.7287/peerj.preprints.1633v1

Published: 2016-01-07
Accepted: 2016-01-07

Subject Areas: Biodiversity, Bioinformatics, Ecology, Genetics, Taxonomy
Keywords: taxonomic impediment, family, genus, Global Genome Initiative, genome

Copyright: © 2016 Coddington et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Coddington JA, Agnarsson I, Cheng R, Čandek K, Driskell A, Frick H, Gregorič M, Kostanjšek R, Kropf C, Kweskin M, Lokovšek T, Pipan M, Vidergar N, Kuntner M. 2016. DNA barcode data accurately identify higher taxa. PeerJ PrePrints 4:e1633v1 https://doi.org/10.7287/peerj.preprints.1633v1

Abstract

The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios “barcodes” (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level identifications. We used BLAST queries of each sequence against the entire library and got the top ten hits resulting in 8160 hits. The percent sequence identity was reported from these hits (PIdent, range 75-100%). Accurate identification (PIdent above which errors totaled less than 5%) occurred for genera at PIdent values > 95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for generic and familial identifications in spiders. Accuracy of identification increases with numbers of species/genus and genera/family in the library; above five genera per family and fifteen species per genus all identifications were correct. We propose that using percent sequence identity between conventional barcode sequences may be a feasible and reasonably accurate method to identify animals to family/genus. However, the quality of the underlying database impacts accuracy of results; many outliers in our dataset could be attributed to taxonomic and/or sequencing errors in BOLD and GenBank. It seems that an accurate and complete reference library of families and genera of life could provide accurate higher level taxonomic identifications cheaply and accessibly, within years rather than decades.