Using machine learning to predict DNA read alignment quality
- Published
- Accepted
- Subject Areas
- Bioinformatics, Data Science
- Keywords
- DNA read alignment, entropy, DUST, machine learning, DNA sequence complexity, bisulfite, run length, feature selection
- Copyright
- © 2018 Porter
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. Using machine learning to predict DNA read alignment quality. PeerJ Preprints 6:e27428v1 https://doi.org/10.7287/peerj.preprints.27428v1
Abstract
An empirical understanding of how DNA read features affect read mapping and alignment quality could be useful in designing better read mapping and alignment software, read trimmers, and sequence masks. Many programs appear to use arbitrarily chosen features that are putatively relevant to DNA alignment quality. Machine learning gives a ready way to empirically assess a variety of features and rank them according to their importance. Sequence complexity features such as run length distribution, DUST, and entropy and quality measures from the DNA read data were used to predict read mapping quality on Ion Torrent and Illumina data sets using both bisulfite-treated and untreated short DNA reads. Surprisingly, run length distribution mean and variance did as well or better than DUST and entropy even though several programs use DUST and entropy. Predictive accuracy of the models had F1-scores between 0.5-0.95; thus, the feature set is useful for understanding alignment quality.
Author Comment
This is a submission to PeerJ Computer Science for review.