Parameters tuning boosts hyperSMURF predictions of rare deleterious non-coding genetic variants
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology, Computational Science, Data Mining and Machine Learning
- Keywords
- non-coding deleterious variants, rare genetic diseases, Mendelian diseases, neutral variants
- Copyright
- © 2017 Petrini et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2017. Parameters tuning boosts hyperSMURF predictions of rare deleterious non-coding genetic variants. PeerJ Preprints 5:e3185v1 https://doi.org/10.7287/peerj.preprints.3185v1
Abstract
The regulatory code that determines whether and how a given genetic variant affects the function of a regulatory element remains poorly understood for most classes of regulatory variation. Indeed the large majority of bioinformatics tools have been developed to predict the pathogenicity of genetic variants in coding sequences or conserved splice sites. Computational algorithms for the prediction of non-coding deleterious variants associated with rare genetic diseases are faced with special challenges owing to the rarity of confirmed pathogenic mutations. Indeed in this context classical machine learning methods are biased toward neutral variants that constitute the large majority of genetic variation, and are not able to detect the potential deleterious variants that constitute only a tiny minority of all known genetic variation. We recently proposed hyperSMURF, hyper-ensemble of SMOTE Undersampled Random Forests, an ensemble approach explicitly designed to deal with the huge imbalance between deleterious and neutral variants, and able to significantly outperform state-of-the-art methods for the prediction of non-coding variants associated with Mendelian diseases. Despite its successful application to the detection of deleterious single nucleotide variants (SNV) as well as to small insertions or deletions (indels), hyperSMURF is a method that depends on several learning parameters, that strongly influence its overall performances. In this work we experimentally show that by tuning hyperSMURF parameters we can significantly boost the performance of the method, thus predicting with significantly better precision and recall rare SNVs associated with Mendelian diseases.
Author Comment
This is an abstract which has been accepted for the NETTAB 2017 Workshop