Using machine learning to predict DNA read alignment quality

Jacob Porter

doi:10.7287/peerj.preprints.27428v1

Using machine learning to predict DNA read alignment quality

Jacob Porter

Biocomplexity Institute, Virginia Tech, Blacksburg, Virginia, United States

DOI: 10.7287/peerj.preprints.27428v1

Published: 2018-12-13
Accepted: 2018-12-13

Subject Areas: Bioinformatics, Data Science
Keywords: DNA read alignment, entropy, DUST, machine learning, DNA sequence complexity, bisulfite, run length, feature selection

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Porter J. 2018. Using machine learning to predict DNA read alignment quality. PeerJ Preprints 6:e27428v1 https://doi.org/10.7287/peerj.preprints.27428v1

Abstract

An empirical understanding of how DNA read features affect read mapping and alignment quality could be useful in designing better read mapping and alignment software, read trimmers, and sequence masks. Many programs appear to use arbitrarily chosen features that are putatively relevant to DNA alignment quality. Machine learning gives a ready way to empirically assess a variety of features and rank them according to their importance. Sequence complexity features such as run length distribution, DUST, and entropy and quality measures from the DNA read data were used to predict read mapping quality on Ion Torrent and Illumina data sets using both bisulfite-treated and untreated short DNA reads. Surprisingly, run length distribution mean and variance did as well or better than DUST and entropy even though several programs use DUST and entropy. Predictive accuracy of the models had F1-scores between 0.5-0.95; thus, the feature set is useful for understanding alignment quality.

Author Comment

This is a submission to PeerJ Computer Science for review.

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article