Similarity thresholds used in short read assembly reduce the comparability of population histories across species

Michael G Harvey; Caroline Duffie Judy; Glenn F Seeholzer; James M Maley; Gary R Graves; Robb T Brumfield

doi:10.7287/peerj.preprints.864v1

Similarity thresholds used in short read assembly reduce the comparability of population histories across species

Michael G Harvey ¹, Caroline Duffie Judy^1,2, Glenn F Seeholzer¹, James M Maley^1,3, Gary R Graves^2,4, Robb T Brumfield¹

1 Museum of Natural Science and Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana, USA

2 Department of Vertebrate Zoology, MRC-116, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA

3 Moore Laboratory of Zoology, Occidental College, Los Angeles, California, United States

4 Center for Macroecology, Evolution and Climate, Natural History Museum of Denmark,University of Copenhagen, Copenhagen, Denmark

DOI: 10.7287/peerj.preprints.864v1

Published: 2015-02-28
Accepted: 2015-02-28

Subject Areas: Bioinformatics, Evolutionary Studies, Genetics, Genomics, Molecular Biology
Keywords: sequence assembly, next-generation sequencing, bioinformatics, birds, non-model species

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Harvey MG, Judy CD, Seeholzer GF, Maley JM, Graves GR, Brumfield RT. 2015. Similarity thresholds used in short read assembly reduce the comparability of population histories across species. PeerJ PrePrints 3:e864v1 https://doi.org/10.7287/peerj.preprints.864v1

Abstract

Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted effects of evolutionary processes across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de novo assembly of short reads into loci for analysis, but the resulting datasets are sensitive to both the similarity threshold used and to the variation naturally present in the organism under study. Stringent thresholds as well as highly variable species may result in datasets in which divergent alleles are lost or divided into separate loci ('over-splitting'), whereas liberal thresholds increase the risk of paralogous loci being combined into a single locus (‘under-splitting’). Comparisons among datasets or species are therefore potentially biased if different similarity thresholds are applied or if the species differ in levels of genetic variation. We examine the impact of a range of similarity thresholds on assembly of empirical short read datasets from populations of four different non-model bird lineages (species or species pairs) with different levels of genetic divergence. We find that, in all species, stringent similarity thresholds result in fewer alleles per locus than more liberal thresholds, which appears to be the result of high levels of over-splitting at stringent thresholds. The frequency of putative under-splitting, conversely, is low at all thresholds. Inferred genetic distances between individuals, gene tree depths, and estimates of the ancestral mutation-scaled effective population size (θ) differ depending upon the similarity threshold applied. Relative differences in inferences across species differ even when the same threshold is applied, but may be dramatically different when datasets assembled under different thresholds are compared. We suggest some best practices for assembling short read data to maximize comparability, such as using more liberal thresholds and examining the impact of different thresholds on each dataset.

Similarity thresholds used in short read assembly reduce the comparability of population histories across species

Abstract

Author Comment

Supplemental Information

Supplemental Information

Add your feedback

Supplemental Information

Supplemental Information

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article